Re: POC: GROUP BY optimization

Claudio Freire Wed, 06 Jun 2018 15:23:50 -0700

On Wed, Jun 6, 2018 at 7:18 PM Claudio Freire <[email protected]> wrote:
> > > Comparison cost can be approximated probabilistically as:
> > >
> > > cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))
> > >
> > > Where cost_op_n is the cost of the comparison function for column N,
> > > and ndistinct(col_1_to_n) is an estimation of the number of distinct
> > > values for columns prior to N in the sort order.
> > >
> > > You can approximate ndistinct as the product of ndistinct of previous
> > > columns, or use extended statistics when available.
> > >
> >
> > Sure. The trouble is this also assumes uniform distribution, which is
> > not always the case.
>
> Well, (1.0 / ndistinct) = p(left == right).
>
> If you can compute a better p(left == right) with an MCV, you can get
> a better estimate. If you have an MCV. But that'd be a bit expensive,
> and I'm not sure it's all that relevant.
>
> To what degree does improving this produce better plans? As long as
> average expected cost is relatively well estimated (as in, one sort
> order relative to another sort order), it won't affect the decision.
>
> But if needed, the MCV can be used for this.
>
> So, in essence, you need to account for:
>
> - restrictions on that column that constrain the domain
> - distribution skew (MCV, nulls)
> - ndistinct
>
> To compute p(left == right).
>
> Maybe something as simple as the following?
>
> p_special = sum(mcv_i * mcv_i) + null_frac * null_frac
> p_other = (1 - p_special) * (1 - p_special) / ndistinct(restr)


Well, that came out with a slew of math errors, but the idea is clear:
compute p(left == right) given the available statistics and
constrained by any applicable restrictions.

Re: POC: GROUP BY optimization

Reply via email to