On 07/17/2018 11:09 AM, Dean Rasheed wrote:
On 16 July 2018 at 21:55, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
...
>>
So, how would the proposed algorithm work? Let's start with "a=1":
sel(a=1) = 0.1508
I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):
sel(b=1 or b=2) = 0.0673
And we get
total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101
From the multivariate MCV we get
mcv_sel = 0.0399
And finally
total_sel = Max(total_sel, mcv_sel) = 0.0399
Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.
Right. If these are the only stats available, and there are just 2
top-level clauses like this, it either returns the MCV estimate, or
the old univariate estimate (whichever is larger). It avoids
over-estimating, but will almost certainly under-estimate when the MCV
matches are incomplete.
Yeah :-(
In my experience under-estimates tend to have much worse consequences
(say a nested loop chosen by under-estimate vs. hash join chosen by
over-estimate). This certainly influenced some of the choices I've made
in this patch (extrapolation to non-MCV part is an example of that), but
I agree it's not particularly scientific approach and I'd very much want
something better.
I could do the same thing for the second query, but the problem there is
actually exactly the same - extrapolation from MCV to non-MCV part
roughly doubles the estimate.
So unless I'm applying your algorithm incorrectly, this does not seem
like a very promising direction :-(
You could be right. Actually it's the order dependence with more than
2 top-level clauses that bothers me most about this algorithm. It's
also not entirely obvious how to include histogram stats in this
scheme.
I think for inequalities that's fairly simple - histograms work pretty
well for that, and I have a hunch that replacing the 0.5 estimate for
partially-matching buckets with conver_to_scalar-like logic and the
geometric mean (as you proposed) will work well enough.
For equalities it's going to be hard. The only thing I can think of at
the moment is checking if there are any matching buckets at all, and
using that to decide whether to extrapolate the MCV selectivity to the
non-MCV part or not (or perhaps to what part of the non-MCV part).
A different approach that I have been thinking about is, in
mcv_update_match_bitmap(), attempt to work out the probability of all
the clauses matching and it not being an MCV value. For example, given
a clause like a=1 whose univariate estimate we know (0.1508 in the
above example), knowing that the MCV values account for 0.0222+0.0177
of that, the remainder must be from non-MCV values. So we could say
that the probability that a=1 and it not being and MCV is
0.1508-0.0222-0.0177 = 0.1109. So then the question is could we
combine these non-MCV probabilities to give an estimate of how many
non-MCV values we need to worry about? I've not fully thought that
through, but it might be useful.
Could we use it to derive some upper boundaries on the non-MCV part?
The problem is, it's still likely to
over-estimate when the MCV matches fully cover all possibilities, and
I still don't see a way to reliably detect that case.
I guess we can use a histogram to limit the over-estimates, as explained
above. It may not be 100% reliable as it depends on the sample and how
exactly the buckets are formed, but it might help.
But perhaps these over-estimates are a particularly serious issue? When
you already get matches in a MCV, the number of matching rows is going
to be pretty significant. If you over-estimate 2x, so what? IMHO that's
still pretty accurate estimate.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services