Re: cost_sort() improvements

Tomas Vondra Tue, 10 Jul 2018 13:57:36 -0700

On 06/28/2018 06:47 PM, Teodor Sigaev wrote:
> Hi!
>
> ...
>
>  - cost for i-th columns:
>    Ci * log2(NGi) => Ci * log2(N/G(i-1))
>    So, here I believe, that i-th comparison function will be called only
> inside
>    group which number is defined by (i-1) leading columns. Note, following
>    discussion [1] I add multiplier 1.5 here to be closer to worst case when
>    groups are significantly differ. Right now there is no way to
> determine how
>    differ they could be. Note, this formula describes cost of first
> column too
>    except difference with multiplier.


One more thought about estimating the worst case - I wonder if simply
multiplying the per-tuple cost by 1.5 is the right approach. It does not
seem particularly principled, and it's trivial simple to construct
counter-examples defeating it (imagine columns with 99% of the rows
having the same value, and then many values in exactly one row - that
will defeat the ndistinct-based group size estimates).

The reason why constructing such counter-examples is so simple is that
this relies entirely on the ndistinct stats, ignoring the other stats we
already have. I think this might leverage the per-column MCV lists (and
eventually multi-column MCV lists, if that ever gets committed).

We're estimating the number of tuples in group (after fixing values in
the preceding columns), because that's what determines the number of
comparisons we need to make at a given stage. The patch does this by
estimating the average group size, by estimating ndistinct of the
preceding columns G(i-1), and computing ntuples/G(i-1).

But consider that we also have MCV lists for each column - if there is a
very common value, it should be there. So let's say M(i) is a frequency
of the most common value in i-th column, determined either from the MCV
list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
for the fist i columns, then the largest possible group of values when
comparing values in i-th column is

    N * m(i-1)

This seems like a better way to estimate the worst case. I don't know if
this should be used instead of NG(i), or if those two estimates should
be combined in some way.

I think estimating the largest group we need to sort should be helpful
for the incremental sort patch, so I'm adding Alexander to this thread.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: cost_sort() improvements

Reply via email to