> On Feb 25, 2016, at 3:16 PM, Mark Dilger <[email protected]> wrote:
>
>
>> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <[email protected]>
>> wrote:
>>
>> <snip>
>>
>> So much better. Clearly, there are cases where this will over-estimate the
>> cardinality - for example when the values are somehow correlated.
>>
>
> I applied your patch, which caused a few regression tests to fail. Attached
> is a patch that includes the necessary changes to the expected test results.
>
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them. My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
>
>
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
> QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
> Group Key: a
> -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
> Filter: (b < 1000)
> (4 rows)
>
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
> count
> -------
> 32
> (1 row)
>
>
>
> So, it estimates 850 rows where only 32 are returned . Without applying your
> patch,
> it estimates just 1 row where 32 are returned. That's an overestimate of
> roughly 26 times,
> rather than an underestimate of 32 times.
>
> As a patch review, I'd say that your patch does what you claim it does, and
> it applies
> cleanly, and passes the regression tests with my included modifications. I
> think there
> needs to be some discussion on the list about whether the patch is a good
> idea.
>
> Mark Dilger
>
>
> <estimate-num-groups-v2.txt>
Assuming the goal is to minimize the degree to which the estimates are
inaccurate, I
get better results by a kind of averaging of the old values from the current
code base
and the new values from Tomas's patch. I tried this and at least for the few
examples
we've been throwing around, I found the results were not as wildly inaccurate
in the
worst case examples than in either of the other two approaches:
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
Assert(rel->reloptkind == RELOPT_BASEREL);
if (rel->tuples > 0)
{
+ double old_factor; /* The way it
is currently done on master */
+ double new_factor; /* The way
Tomas Vondra proposes doing it */
/*
* Clamp to size of rel, or size of rel / 10 if
multiple Vars. The
* fudge factor is because the Vars are probably
correlated but we
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
/*
* Multiply by restriction selectivity.
*/
- reldistinct *= rel->rows / rel->tuples;
+ old_factor = rel->rows / rel->tuples; /* old
way */
+ new_factor = (1 - powl((reldistinct - 1) / reldistinct,
rel->rows)); /* Tomas's way */
+
+ reldistinct *= sqrt(old_factor * new_factor); /*
average of old way and new way, sort of */
/*
* Update estimate of total distinct groups.
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers