On Wed, 17 Mar 2021 at 20:48, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > > For reference, here is the test case I was using (which isn't really very > good for > catching dependence between columns): >
And here's a test case with much more dependence between the columns: DROP TABLE IF EXISTS foo; CREATE TABLE foo (a int, b int, c int, d int); INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,100000) x; SELECT COUNT(DISTINCT a) FROM foo; -- 2 SELECT COUNT(DISTINCT b) FROM foo; -- 5 SELECT COUNT(DISTINCT c) FROM foo; -- 10 SELECT COUNT(DISTINCT d) FROM foo; -- 15 SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6 SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20 SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10 SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30 -- First case: stats on [(a+b),c] CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 150, Actual = 30 -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15, -- which is much better than ndistinct((a+b)) * ndistinct(c) * ndistinct(d) = 6*10*15 = 900 -- Estimate with no stats = 1500 -- Second case: stats on (c+d) as well CREATE STATISTICS s2 ON (c+d) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 120, Actual = 30 -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20 Again, I'd say the current behaviour is pretty good. Regards, Dean