On 06/06/17 10:12, Gavin Flower wrote:

On 06/06/17 09:41, Mark Kirkwood wrote:On 05/06/17 09:30, Tom Lane wrote:## Advertising

I've been thinking about the behavior discussed inhttps://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.organd it seems to me that there are a couple of things we ought to doaboutit. First, I think we need a larger hard floor on the number of occurrences of a value that're required to make ANALYZE decide it is a "most common value". The existing coding is willing to believe that anything thatappears at least twice in the sample is a potential MCV, but thatdesignoriginated when we were envisioning stats samples of just a fewthousandrows --- specifically, default_statistics_target was originally just10,leading to a 3000-row sample size. So accepting two-appearancevalues asMCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it could be a tenth or a hundredth of that. As a round number, I'm thinking that a good floor would be a frequency estimate of 1/1000. With today's typical sample size of 30000 rows, a value would have to appear at least 30 times in the sample to be believed to be an MCV. That seems like it gives us a reasonable margin of error against the kind of sampling noise seen in the above-cited thread. Second, the code also has a rule that potential MCVs need to have anestimated frequency at least 25% larger than what it thinks the"average"value's frequency is. A rule of that general form seems like a goodidea,but I now think the 25% threshold is far too small to do anythinguseful.In particular, in any case like this where there are more distinctvaluesthan there are sample rows, the "average frequency" estimate willcorrespond to less than one occurrence in the sample, so that thisrule istotally useless to filter anything that we would otherwise consideras anMCV. I wonder if we shouldn't make it be "at least double theestimatedaverage frequency".Or possibly calculate the sample standard deviation and make use ofthat to help decide on a more flexible cutoff than twice the avgfrequency?Are there any research papers that might help us here (I'm drowningin a sea of barely relevant search results for most phrases I'vetried so far)? I recall there were some that Tom referenced when thisstuff was originally written.On the other hand I do have access to some mathematiciansspecializing in statistics - so can get their thoughts on this issueif you feel it would be worthwhile.Cheers MarkThe standard deviation (sd) is proportional to the square root of thenumber in the sample in a Normal Distribution.In a Normal Distribution, about 2/3 the values will be within plus orminus one sd of the mean.There seems to be an implicit assumption that the distribution ofvalues follows the Normal Distribution - has this been verified? Isuspect that real data will have a skewed distribution of values, andmay even be multi modal (multiple peaks) The Normal Distribution hasone central peak with 2 tails of the same shape & size.So in a sample of 100 the sd is proportional to 10%, for 10,000 the sd is proportional to 1%.So essentially, the larger the sample size the more reliable isknowledge of the most common values (ignoring pathologically extremedistributions!) - the measure of reliability depends on the distribution.How about selecting the cut off as the mean plus one sd, or somethingof that nature? Note that the cut off point may result in no mcv'sbeing selected - especially for small samples.If practicable, it would be good to sample real datasets. Suggestlooking at datasets were the current mechanism looks reasonable, andones were the estimates are too far off. Also, if possible, try anynew selection method on the datasets and see what the difference is.The above is based on what I remember from my university statisticspapers, I took it up to 4th year level many moons ago.

`With respect to the assumption of Normal distribution - it is easy to`

`come up with an example that shows it need not be: consider our our`

`Pgbench 'accounts' table. The distribution of 'bid' values is not Normal`

`(it is Uniform).`

`Now I realized I made people think about Normal distributions by`

`mentioning computing the standard deviation of the sample (and I`

`probably should have said 'standard deviation of the *sample`

`frequencies*') - but this was only a suggestion for working out how to`

`(maybe) be less arbitrary about how we decide what values to put in the`

`MCV list (currently 25% different from the mean frequency).`

regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers