Re: POC: GROUP BY optimization

Claudio Freire Wed, 06 Jun 2018 14:23:33 -0700

On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
>
> >>>> For example, it seems to disregard that different data types have
> >>>> different comparison costs. For example comparing bytea will be far
> >>>> more expensive compared to int4, so it may be much more efficient to
> >>>> compare int4 columns first, even if there are far fewer distinct
> >>>> values in them.
> >>> as I can see cost_sort() doesn't pay attention to this details. And
> >>> it should be a separated patch to improve that.
> >>>
> >>
> >> But sort also does not reorder columns.
> > Yes. But estimation of cost of comparing function/number of unique
> > values in column could be not very accurate and so planner could make
> > a wrong choice.


...
>
> >> Imagine you have a custom data type that is expensive for comparisons.
> >> You know that, so you place it at the end of GROUP BY clauses, to
> >> reduce the number of comparisons on that field. And then we come along
> >> and just reorder the columns, placing it first, because it happens to
> >> have a high ndistinct statistic. And there's no way to get the
> >> original behavior :-(
> > Hm. that it could be, but I don't know how to improve here.  Current
> > cost_sort() will return the same cost for any columns order.
> >
> > Okay, here we know an estimation of nrow, we could somehow find a
> > comparing function oid and find a its procost field. And then? we also
> > need to know width of tuple here and mostly repeat the cost_sort.
> >
> > Another option is an introducing enable_groupby_reorder GUC variable
> > although personally I don't like an idea to add new GUC variable.
> >
>
> I don't like the idea of yet another GUC either, as it's rather blunt
> instrument. Which is why I'm suggesting to split the patch into two
> parts. Then we can apply the index stuff (which seems relatively
> straightforward) and keep working on this second part.
>
> FWIW I think it would be useful to have "development GUC" that would
> allow us to enable/disable these options during development, because
> that makes experiments much easier. But then remove them before commit.

Correct me if I'm wrong, but doesn't this patch concern itself with
precisely sort performance?

As such, estimating sort performance correctly in the various plan
variants being considered seems to be a very central aspect of it.

This means it's pretty much time to consider the effect of operator
cost *and* distinct values in the cost computation.

Assuming cost_sort does its thing, it's just a matter of building the
desired variants and picking the one with the smallest cost_sort. One
can use heuristics to build a subset of all possible column orderings
with a guiding principle that tries to maximize the likelihook of
finding a better order than the one in the query (like sorting by
ndistinct), but I'd suggest:

- Always considering the sort order verbatim as given in the SQL (ie:
what the user requests)
- Relying on cost_sort to distinguish among them, and pick a winner, and
- When in doubt (if cost_sort doesn't produce a clear winner), use the
order given by the user

Comparison cost can be approximated probabilistically as:

cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))

Where cost_op_n is the cost of the comparison function for column N,
and ndistinct(col_1_to_n) is an estimation of the number of distinct
values for columns prior to N in the sort order.

You can approximate ndistinct as the product of ndistinct of previous
columns, or use extended statistics when available.

I think that should give a good enough approximation of
ndistinct-enriched sort costs that considers operator cost
appropriately. That operator cost is basically an average and can be
used as a constant, so it's just a matter of passing the right
comparison_cost to cost_sort.

Even if ndistinct estimates are off, the fact that operator cost is
involved should be a good enough tool for the user should the planner
pick the wrong path - it's only a matter of boosting operator cost to
let the planner know that sorting that way is expensive.

Priorization of the user-provided order can be as simple as giving
that comparison_cost a small handicap.

Re: POC: GROUP BY optimization

Reply via email to