Re: [HACKERS] improving GROUP BY estimation

Tomas Vondra Thu, 25 Feb 2016 17:00:07 -0800

Hi,

On 02/26/2016 12:16 AM, Mark Dilger wrote:


It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them.  My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:


CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
                           QUERY PLAN
---------------------------------------------------------------
  HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
    Group Key: a
    ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
          Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
  count
-------
     32
(1 row)



So, it estimates 850 rows where only 32 are returned . Without
applying your patch, it estimates just 1 row where 32 are returned.
That's an overestimate of roughly 26 times, rather than an
underestimate of 32 times.

The culprit here is that the two columns are not independent, but are(rather strongly) correlated due to the way you've generated the data.

In cases like this (with correlated columns), it's mostly just luck howaccurate estimates you get, no matter which of the estimators you use.It's pretty easy to generate arbitrary errors by breaking theindependence assumption, and it's not just this particular place of theplanner. And it won't change until we add some sort of smartness aboutdependencies between columns.

I think over-estimates are a bit more acceptable in this case, e.g.because of the HashAggregate OOM risk. Also, I've seen too many nestedloop cascades due to under-estimates recently, but that's a bit unrelated.

As a patch review, I'd say that your patch does what you claim it
does, and it applies cleanly, and passes the regression tests with
my included modifications. I think there needs to be some discussion
on the list about whether the patch is agood idea.

Yes, that's why I haven't bothered with fixing the regression tests inthe patch, actually.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improving GROUP BY estimation

Reply via email to