Re: [HACKERS] improving GROUP BY estimation

Mark Dilger Thu, 03 Mar 2016 09:41:29 -0800

> On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> 
> Hi,
> 
> On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
>> Hi, Tomas!
>> 
>> I've assigned to review this patch.
>> 
>> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
>> It applies to head cleanly, passes corrected regression tests.
>> 
>> About correlated/uncorrelated cases. I think now optimizer mostly assumes
>> all distributions to be independent. I think we should follow this
>> assumption in this case also until we have fundamentally better option (like
>> your multivariate statistics).
>> 
>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List 
>> *groupExprs,
>> double input_rows,
>> reldistinct = clamp;
>> 
>> /*
>> -* Multiply by restriction selectivity.
>> +* Estimate number of distinct values expected in given number of rows.
>> */
>> -reldistinct *= rel->rows / rel->tuples;
>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>> 
>> /*
>> * Update estimate of total distinct groups.
>> 
>> I think we need way more explanation in comments here (possibly with
>> external links). For now, it looks like formula which appears from nowhere.
> 
> I thought the details (particularly the link to stackexchange, discussing the 
> formula) would be enough, but let me elaborate.
> 
> The current formula
> 
>    reldistinct *= rel->rows / rel->tuples;
> 
> simply multiplies the ndistinct estimate with selectivity. So when you select 
> 1% of the table, the estimate says we'll see 1% of ndistinct values. But 
> clearly that's naive, because for example when you have 10k distinct values 
> and you select 10M rows, you should expect nearly all of them in the sample.


The current formula assumes total dependence between the
columns.  You can create tables where this is true, and the
formula gives precisely the right estimate.  I'm not claiming this
is common, or that it should be the default assumption, but it
is one end of the spectrum of possible correct estimates.

> And that's what the formula does - it gives you the expected number of 
> distinct values, when selecting 'k' values from 'd' distinct values with 
> replacements:
> 
>    count(k, d) = d * (1 - ((d-1)/d) ^ k)
> 
> It's assuming the distinct values are equally probable (uniform distribution) 
> and that the probabilities do not change (that's the "with replacement").

The new formula assumes total independence between the
columns.  You can likewise create tables where this is true,
and you did so upthread, and for those tables it gives precisely
the right estimate.  This is the other end of the spectrum of
possible correct estimates.

In the absence of multivariate statistics, either because you
haven't committed that work yet, or because the multivariate
data the planner needs hasn't been collected, choosing an
estimate exactly in the middle of the spectrum (whatever
that would mean mathematically) would minimize the
maximum possible error in the estimate.

I have the sense that my point of view is in the minority, because
the expectation is the problems caused by independence
assumptions will only be fixed when multivariate statistics are
implemented and available, and so we should just keep the
independence assumptions everywhere until that time.  I
tend to disagree with that, on the grounds that even when that
work is finished, we'll never have complete coverage of every
possible set of columns and what their degree of dependence
is.

Perhaps I am badly mistaken.

Can somebody more familiar with the issues than I am please
disabuse me of my wrongheadedness?

Mark Dilger



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improving GROUP BY estimation

Reply via email to