On Wed, 17 Mar 2021 at 19:07, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 3/17/21 7:54 PM, Dean Rasheed wrote: > > > > it might have been better to estimate the first case as > > > > ndistinct((a+b)) * ndistinct(c) * ndistinct(d) > > > > and the second case as > > > > ndistinct((a+b)) * ndistinct((c+d)) > > OK. I might be confused, but isn't that what the algorithm currently > does? Or am I just confused about what the first/second case refers to? >
No, it currently estimates the first case as ndistinct((a+b),c) * ndistinct(d). Having said that, maybe that's OK after all. It at least makes an effort to account for any correlation between (a+b) and (c+d), using the known correlation between (a+b) and c. For reference, here is the test case I was using (which isn't really very good for catching dependence between columns): DROP TABLE IF EXISTS foo; CREATE TABLE foo (a int, b int, c int, d int); INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,100000) x; SELECT COUNT(DISTINCT a) FROM foo; -- 10 SELECT COUNT(DISTINCT b) FROM foo; -- 11 SELECT COUNT(DISTINCT c) FROM foo; -- 12 SELECT COUNT(DISTINCT d) FROM foo; -- 13 SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20 SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24 SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228 SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478 -- First case: stats on [(a+b),c] CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 2964, Actual = 478 -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13 -- Second case: stats on (c+d) as well CREATE STATISTICS s2 ON (c+d) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 480, Actual = 478 -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24 I think that's probably pretty reasonable behaviour, given incomplete stats (the estimate with no extended stats is capped at 10000). Regards, Dean