> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.von...@2ndquadrant.com>
> On 02/26/2016 12:16 AM, Mark Dilger wrote:
>> It is not hard to write test cases where your patched version overestimates
>> the number of rows by a very similar factor as the old code underestimates
>> them. My very first test, which was not specifically designed to demonstrate
>> this, happens to be one such example:
>> CREATE TABLE t (a INT, b int);
>> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
>> ANALYZE t;
>> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>> QUERY PLAN
>> HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
>> Group Key: a
>> -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
>> Filter: (b < 1000)
>> (4 rows)
>> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
>> (1 row)
>> So, it estimates 850 rows where only 32 are returned . Without
>> applying your patch, it estimates just 1 row where 32 are returned.
>> That's an overestimate of roughly 26 times, rather than an
>> underestimate of 32 times.
> The culprit here is that the two columns are not independent, but are (rather
> strongly) correlated due to the way you've generated the data.
Yes, that was intentional. Your formula is exactly correct, so far as i can
for completely uncorrelated data. I don't have any tables with completely
uncorrelated data, and was therefore interested in what happens when the
data is correlated and your patch is applied.
BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base. I'm just playing Devil's Advocate
here, to see if there is room for any improvement.
> In cases like this (with correlated columns), it's mostly just luck how
> accurate estimates you get, no matter which of the estimators you use. It's
> pretty easy to generate arbitrary errors by breaking the independence
> assumption, and it's not just this particular place of the planner. And it
> won't change until we add some sort of smartness about dependencies between
> I think over-estimates are a bit more acceptable in this case, e.g. because
> of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades
> due to under-estimates recently, but that's a bit unrelated.
I have seen similar problems in systems I maintain, hence my interest
in your patch.
>> As a patch review, I'd say that your patch does what you claim it
>> does, and it applies cleanly, and passes the regression tests with
>> my included modifications. I think there needs to be some discussion
>> on the list about whether the patch is agood idea.
> Yes, that's why I haven't bothered with fixing the regression tests in the
> patch, actually.
Right, but for the group as a whole, I thought it might generate some
feedback if people saw what else changed in the regression results.
If you look, the changes have to do with plans chosen and row estimates --
exactly the sort of stuff you would expect to change.
Thanks for the patch submission. I hope my effort to review it is on the
whole more helpful than harmful.
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: