> On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnor...@gmail.com> wrote:
> 
> 
>> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> 
>> wrote:
>> 
>> <snip>
>> 
>> So much better. Clearly, there are cases where this will over-estimate the 
>> cardinality - for example when the values are somehow correlated.
>> 
> 
> I applied your patch, which caused a few regression tests to fail.  Attached
> is a patch that includes the necessary changes to the expected test results.
> 
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them.  My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
> 
> 
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>                          QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>   Group Key: a
>   ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>         Filter: (b < 1000)
> (4 rows)
> 
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
> count 
> -------                     
>    32
> (1 row)
> 
> 
> 
> So, it estimates 850 rows where only 32 are returned .  Without applying your 
> patch,
> it estimates just 1 row where 32 are returned.  That's an overestimate of 
> roughly 26 times,
> rather than an underestimate of 32 times.
> 
> As a patch review, I'd say that your patch does what you claim it does, and 
> it applies
> cleanly, and passes the regression tests with my included modifications.  I 
> think there
> needs to be some discussion on the list about whether the patch is a good 
> idea.
> 
> Mark Dilger
> 
> 
> <estimate-num-groups-v2.txt>

Assuming the goal is to minimize the degree to which the estimates are 
inaccurate, I 
get better results by a kind of averaging of the old values from the current 
code base
and the new values from Tomas's patch.  I tried this and at least for the few 
examples
we've been throwing around, I found the results were not as wildly inaccurate 
in the
worst case examples than in either of the other two approaches:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, 
double input_rows,
                Assert(rel->reloptkind == RELOPT_BASEREL);
                if (rel->tuples > 0)
                {
+                       double          old_factor;             /* The way it 
is currently done on master */
+                       double          new_factor;             /* The way 
Tomas Vondra proposes doing it */
                        /*
                         * Clamp to size of rel, or size of rel / 10 if 
multiple Vars. The
                         * fudge factor is because the Vars are probably 
correlated but we
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, 
double input_rows,
                        /*
                         * Multiply by restriction selectivity.
                         */
-                       reldistinct *= rel->rows / rel->tuples;
+                       old_factor = rel->rows / rel->tuples;           /* old 
way */
+                       new_factor = (1 - powl((reldistinct - 1) / reldistinct, 
rel->rows));  /* Tomas's way */
+
+                       reldistinct *= sqrt(old_factor * new_factor);   /* 
average of old way and new way, sort of */
 
                        /*
                         * Update estimate of total distinct groups.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to