> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: > > Hi, > > On 02/26/2016 12:16 AM, Mark Dilger wrote: >> >> It is not hard to write test cases where your patched version overestimates >> the number of rows by a very similar factor as the old code underestimates >> them. My very first test, which was not specifically designed to demonstrate >> this, happens to be one such example: >> >> >> CREATE TABLE t (a INT, b int); >> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs; >> ANALYZE t; >> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; >> QUERY PLAN >> --------------------------------------------------------------- >> HashAggregate (cost=169250.21..169258.71 rows=850 width=4) >> Group Key: a >> -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) >> Filter: (b < 1000) >> (4 rows) >> >> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; >> count >> ------- >> 32 >> (1 row) >> >> >> >> So, it estimates 850 rows where only 32 are returned . Without >> applying your patch, it estimates just 1 row where 32 are returned. >> That's an overestimate of roughly 26 times, rather than an >> underestimate of 32 times. > > The culprit here is that the two columns are not independent, but are (rather > strongly) correlated due to the way you've generated the data.
Yes, that was intentional. Your formula is exactly correct, so far as i can tell, for completely uncorrelated data. I don't have any tables with completely uncorrelated data, and was therefore interested in what happens when the data is correlated and your patch is applied. BTW, the whole reason I responded to your post is that I think I would like to have your changes in the code base. I'm just playing Devil's Advocate here, to see if there is room for any improvement. > In cases like this (with correlated columns), it's mostly just luck how > accurate estimates you get, no matter which of the estimators you use. It's > pretty easy to generate arbitrary errors by breaking the independence > assumption, and it's not just this particular place of the planner. And it > won't change until we add some sort of smartness about dependencies between > columns. > > I think over-estimates are a bit more acceptable in this case, e.g. because > of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades > due to under-estimates recently, but that's a bit unrelated. I have seen similar problems in systems I maintain, hence my interest in your patch. >> As a patch review, I'd say that your patch does what you claim it >> does, and it applies cleanly, and passes the regression tests with >> my included modifications. I think there needs to be some discussion >> on the list about whether the patch is agood idea. > > Yes, that's why I haven't bothered with fixing the regression tests in the > patch, actually. Right, but for the group as a whole, I thought it might generate some feedback if people saw what else changed in the regression results. If you look, the changes have to do with plans chosen and row estimates -- exactly the sort of stuff you would expect to change. Thanks for the patch submission. I hope my effort to review it is on the whole more helpful than harmful. Mark Dilger -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers