Re: [HACKERS] improving GROUP BY estimation

Mark Dilger Thu, 25 Feb 2016 19:32:56 -0800

> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
> wrote:
> 
> Hi,
> 
> On 02/26/2016 12:16 AM, Mark Dilger wrote:
>> 
>> It is not hard to write test cases where your patched version overestimates
>> the number of rows by a very similar factor as the old code underestimates
>> them.  My very first test, which was not specifically designed to demonstrate
>> this, happens to be one such example:
>> 
>> 
>> CREATE TABLE t (a INT, b int);
>> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
>> ANALYZE t;
>> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>>                           QUERY PLAN
>> ---------------------------------------------------------------
>>  HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>>    Group Key: a
>>    ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>>          Filter: (b < 1000)
>> (4 rows)
>> 
>> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
>>  count
>> -------
>>     32
>> (1 row)
>> 
>> 
>> 
>> So, it estimates 850 rows where only 32 are returned . Without
>> applying your patch, it estimates just 1 row where 32 are returned.
>> That's an overestimate of roughly 26 times, rather than an
>> underestimate of 32 times.
> 
> The culprit here is that the two columns are not independent, but are (rather 
> strongly) correlated due to the way you've generated the data.


Yes, that was intentional.  Your formula is exactly correct, so far as i can 
tell,
for completely uncorrelated data.  I don't have any tables with completely
uncorrelated data, and was therefore interested in what happens when the
data is correlated and your patch is applied.

BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base.  I'm just playing Devil's Advocate
here, to see if there is room for any improvement.

> In cases like this (with correlated columns), it's mostly just luck how 
> accurate estimates you get, no matter which of the estimators you use. It's 
> pretty easy to generate arbitrary errors by breaking the independence 
> assumption, and it's not just this particular place of the planner. And it 
> won't change until we add some sort of smartness about dependencies between 
> columns.
> 
> I think over-estimates are a bit more acceptable in this case, e.g. because 
> of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades 
> due to under-estimates recently, but that's a bit unrelated.

I have seen similar problems in systems I maintain, hence my interest
in your patch.


>> As a patch review, I'd say that your patch does what you claim it
>> does, and it applies cleanly, and passes the regression tests with
>> my included modifications. I think there needs to be some discussion
>> on the list about whether the patch is agood idea.
> 
> Yes, that's why I haven't bothered with fixing the regression tests in the 
> patch, actually.

Right, but for the group as a whole, I thought it might generate some
feedback if people saw what else changed in the regression results.
If you look, the changes have to do with plans chosen and row estimates --
exactly the sort of stuff you would expect to change.

Thanks for the patch submission.  I hope my effort to review it is on the
whole more helpful than harmful.

Mark Dilger

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improving GROUP BY estimation

Reply via email to