On 28 December 2017 at 01:45, Jeff Janes <jeff.ja...@gmail.com> wrote:
> If we stored just a few more values, their inclusion in the MCV would mean > they are depleted from the residual count, correctly lowering the estimate > we would get for very rare values not included in the sample. I agree with this thought. > So instead of having the threshold of 1.25x the average frequency over all > values, I think we should use 1.25x the average frequency of only those > values not already included in the MCV, as in the attached. It looks like this might even have been the original intention of that code. Patch looks OK, but I think the comments need minor changes above line 2575 > As it is, you can partially overcome the too short MCV list by cranking up > the default statistics target, but this is a weak effect and comes at a high > cost of CPU time. In some of the real distributions I've looked at, > cranking up default statistics target is almost entirely ineffective. Agreed, not a good solution. > [1] Occasionally it will store a much longer MCV list, because no values was > sampled exactly once, which triggers a different code path in which all seen > values are put in the MCV and the histogram is NULL. This is not reliable, > as whether the least-sample value is present in the sample once or twice is > pretty brittle. And we need a better discussion of risk: Before we generated too few MCV entried. To what extent might me now generate too many? Which would be a problem in increased planning time. I have a slight reservaton about whether 1.25x is still a sensible heuristic. Should we add a parameter for that to allow testing during beta? Marking as Ready For Committer. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services