On 06/06/17 10:12, Gavin Flower wrote:
Suddenly realized, that the distribution of the FREQUENCIES of values in
a Normal Distribution are probably NOT normally distributed!
On 06/06/17 09:41, Mark Kirkwood wrote:
The standard deviation (sd) is proportional to the square root of the
number in the sample in a Normal Distribution.
On 05/06/17 09:30, Tom Lane wrote:
I've been thinking about the behavior discussed in
and it seems to me that there are a couple of things we ought to do
First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value". The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that
originated when we were envisioning stats samples of just a few
rows --- specifically, default_statistics_target was originally just
leading to a 3000-row sample size. So accepting two-appearance
MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
could be a tenth or a hundredth of that.
As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000. With today's typical sample size of 30000 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV. That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the
value's frequency is. A rule of that general form seems like a good
but I now think the 25% threshold is far too small to do anything
In particular, in any case like this where there are more distinct
than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this
totally useless to filter anything that we would otherwise consider
MCV. I wonder if we shouldn't make it be "at least double the
Or possibly calculate the sample standard deviation and make use of
that to help decide on a more flexible cutoff than twice the avg
Are there any research papers that might help us here (I'm drowning
in a sea of barely relevant search results for most phrases I've
tried so far)? I recall there were some that Tom referenced when this
stuff was originally written.
On the other hand I do have access to some mathematicians
specializing in statistics - so can get their thoughts on this issue
if you feel it would be worthwhile.
In a Normal Distribution, about 2/3 the values will be within plus or
minus one sd of the mean.
There seems to be an implicit assumption that the distribution of
values follows the Normal Distribution - has this been verified? I
suspect that real data will have a skewed distribution of values, and
may even be multi modal (multiple peaks) The Normal Distribution has
one central peak with 2 tails of the same shape & size.
So in a sample of 100 the sd is proportional to 10%,
for 10,000 the sd is proportional to 1%.
So essentially, the larger the sample size the more reliable is
knowledge of the most common values (ignoring pathologically extreme
distributions!) - the measure of reliability depends on the distribution.
How about selecting the cut off as the mean plus one sd, or something
of that nature? Note that the cut off point may result in no mcv's
being selected - especially for small samples.
If practicable, it would be good to sample real datasets. Suggest
looking at datasets were the current mechanism looks reasonable, and
ones were the estimates are too far off. Also, if possible, try any
new selection method on the datasets and see what the difference is.
The above is based on what I remember from my university statistics
papers, I took it up to 4th year level many moons ago.
For some shapes, and fancy names for them (like 'leptokurtic'), see:
For multi modal examples:
I tried looking for useful stuff to clarify things without success - so
I think that asking a practising statistician is probably a very good
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: