Re: [HACKERS] More stable query plans via more predictable column statistics

Alex Shulgin Sat, 02 Apr 2016 18:45:14 -0700

On Sat, Apr 2, 2016 at 8:57 PM, Shulgin, Oleksandr <
[email protected]> wrote:
> On Apr 2, 2016 18:38, "Tom Lane" <[email protected]> wrote:
>
>> I did not like the fact that the compute_scalar_stats logic
>> would allow absolutely anything into the MCV list once num_hist falls
>> below 2. I think it's important that we continue to reject values that
>> are only seen once in the sample, because there's no very good reason to
>> think that they are MCVs and not just infrequent values that by luck
>> appeared in the sample.
>
> In my understanding we only put a value in the track list if we've seen it
> at least twice, no?


This is actually the case for compute_scalar_stats, but not for
compute_distinct_stats.  In the latter case we can still have
track[i].count == 1, but we can also break out of the loop if we see the
first tracked item like that.

>> Before I noticed the regression failure, I'd been thinking that maybe
it'd
>> be better if the decision rule were not "at least 100+x% of the average
>> frequency of this value and later ones", but "at least 100+x% of the
>> average frequency of values after this one".
>
> Hm, sounds pretty similar to what I wanted to achieve, but better
> formalized.
>
>> With that formulation, we're
>> not constrained as to the range of x.  Now, if there are *no* values
after
>> this one, then this way needs an explicit special case in order not to
>> compute 0/0; but the preceding para shows that we need a special case for
>> the last value anyway.
>>
>> So, attached is a patch rewritten along these lines.  I used 50% rather
>> than 25% as the new cutoff percentage --- obviously it should be higher
>> in this formulation than before, but I have no idea if that particular
>> number is good or we should use something else.  Also, the rule for the
>> last value is "at least 1% of the non-null samples".  That's a pure guess
>> as well.
>>
>> I do not have any good corpuses of data to try this on.  Can folks who
>> have been following this thread try it on their data and see how it
>> does?  Also please try some other multipliers besides 1.5, so we can
>> get a feeling for where that cutoff should be placed.
>
> Expect me to run it on my pet db early next week. :-)

I was trying to come up with some examples where 50% could be a good or a
bad choice and then I noticed that we might be able to turn it it the other
way round: instead of inventing an arbitrary limit at the MCVs frequency we
could use the histogram as the criteria for a candidate MCV to be
considered "common enough".  If we can prove that the value would produce
duplicates in the histogram, we should rather put it in the MCV list
(unless it's already fully occupied, then we can't do anything).

A value is guaranteed to produce a duplicate if it has appeared at least
2*hfreq+1 times in the sample (hfreq from your version of the patch, which
is recalculated on every loop iteration).  I could produce an updated patch
on Monday or anyone else following this discussion should be able to do
that.

This approach would be a huge win in my opinion, because this way we can
avoid all the arbitrariness of that .25 / .50 multiplier.  Otherwise there
might be (valid) complaints that for some data .40 (or .60) is a better
fit, but we have already hard-coded something and there would be no easy
way to improve situation for some users while avoiding to break it for the
rest (unless we introduce a per-attribute configurable parameter like
statistics_target for this multiplier, which I'd like to avoid even
thinking about ;-)

While we don't (well, can't) build a histogram in the
compute_distinct_stats variant, we could also apply the above mind trick
there for the same reason and to make the output of both functions more
consistent (and to have less maintenance burden between the variants).  And
anyway it would be rather surprising to see that depending on the presence
of an order operator for the type, the resulting MCV lists after the
ANALYZE would be different (I mean not only due to the random nature of the
sample).

I'm not sure yet about the 1% rule for the last value, but would also love
to see if we can avoid the arbitrary limit here.  What happens with a last
value which is less than 1% popular in the current code anyway?

Cheers!
--
Alex

Re: [HACKERS] More stable query plans via more predictable column statistics

Reply via email to