On 07/17/2018 11:09 AM, Dean Rasheed wrote:
On 16 July 2018 at 21:55, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:

...
>>
So, how would the proposed algorithm work? Let's start with "a=1":

    sel(a=1) = 0.1508

I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):

    sel(b=1 or b=2) = 0.0673

And we get

    total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101

 From the multivariate MCV we get

    mcv_sel = 0.0399

And finally

    total_sel = Max(total_sel, mcv_sel) = 0.0399

Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.


Right. If these are the only stats available, and there are just 2
top-level clauses like this, it either returns the MCV estimate, or
the old univariate estimate (whichever is larger). It avoids
over-estimating, but will almost certainly under-estimate when the MCV
matches are incomplete.


Yeah :-(

In my experience under-estimates tend to have much worse consequences (say a nested loop chosen by under-estimate vs. hash join chosen by over-estimate). This certainly influenced some of the choices I've made in this patch (extrapolation to non-MCV part is an example of that), but I agree it's not particularly scientific approach and I'd very much want something better.


I could do the same thing for the second query, but the problem there is
actually exactly the same - extrapolation from MCV to non-MCV part
roughly doubles the estimate.

So unless I'm applying your algorithm incorrectly, this does not seem
like a very promising direction :-(


You could be right. Actually it's the order dependence with more than
2 top-level clauses that bothers me most about this algorithm. It's
also not entirely obvious how to include histogram stats in this
scheme.


I think for inequalities that's fairly simple - histograms work pretty well for that, and I have a hunch that replacing the 0.5 estimate for partially-matching buckets with conver_to_scalar-like logic and the geometric mean (as you proposed) will work well enough.

For equalities it's going to be hard. The only thing I can think of at the moment is checking if there are any matching buckets at all, and using that to decide whether to extrapolate the MCV selectivity to the non-MCV part or not (or perhaps to what part of the non-MCV part).

A different approach that I have been thinking about is, in
mcv_update_match_bitmap(), attempt to work out the probability of all
the clauses matching and it not being an MCV value. For example, given
a clause like a=1 whose univariate estimate we know (0.1508 in the
above example), knowing that the MCV values account for 0.0222+0.0177
of that, the remainder must be from non-MCV values. So we could say
that the probability that a=1 and it not being and MCV is
0.1508-0.0222-0.0177 = 0.1109. So then the question is could we
combine these non-MCV probabilities to give an estimate of how many
non-MCV values we need to worry about? I've not fully thought that
through, but it might be useful.

Could we use it to derive some upper boundaries on the non-MCV part?

The problem is, it's still likely to
over-estimate when the MCV matches fully cover all possibilities, and
I still don't see a way to reliably detect that case.


I guess we can use a histogram to limit the over-estimates, as explained above. It may not be 100% reliable as it depends on the sample and how exactly the buckets are formed, but it might help.

But perhaps these over-estimates are a particularly serious issue? When you already get matches in a MCV, the number of matching rows is going to be pretty significant. If you over-estimate 2x, so what? IMHO that's still pretty accurate estimate.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to