On Wed, Oct 20, 2010 at 3:54 PM, Robert Haas <robertmh...@gmail.com> wrote:
> Yes, I think a percentage of the table is going to break down either
> at the high end or the low end.  Hand-waving (but based on
> experience), for a 1000 row table a statistics target of 10 is
> probably approximately right and 100 is too much and 1 is too little.
> But for a 1,000,000 row table 10,000 is probably too much and even
> 1,000 is pushing it.  So using a constant percentage of table rows
> doesn't feel right.  I had a thought today that it might make sense to
> use an exponential curve, like min(2 * N^(1/3), 10).  I can't really
> justify that mathematically, but that doesn't mean it won't work well
> in practice.

Well we can analyze it but as you said later, it all depends on what
queries you're running. If we want to aim for the same confidence
interval at all times, ie that the estimated frequency is accurate to
within +/- x% 95% of the time then:

If we're querying ranges a..b which represent a constant percentage of
the table we need a fixed number of buckets and a sample size that
varies very little with respect to the size of the table (effectively
constant).

If we're querying ranges a..b which are constant sized and therefore
represent a smaller percentage of the table as it grows then we need a
number of buckets that's proportional to the size of the table. The
sample size is proportional to the number of buckets (ie, it's a
constant sized sample per bucket).

If we're querying for a specific value which isn't one of the most
common values then I'm not clear how to characterize the accuracy or
precision of our current estimates let alone how they would vary if we
changed our sample sizes.

If we need to estimate ndistinct then we clearly need a sample of the
table the size of which is proportional to the size of the table. And
in practice to get accurate results it has to be a fairly high
percentage -- effectively meaning we should read the whole table.
>> I think increasing the MCV is too simplistic since we don't really
>> have any basis for any particular value. I think what we need are some
>> statistics nerds to come along and say here's this nice tool from
>> which you can make the following predictions and understand how
>> increasing or decreasing the data set size affects the accuracy of the
>> predictions.
>
> I'm not sure that's realistic, because everything depends on what
> queries you're running, and you can get arbitrary answers by
> postulating arbitrary queries.  However, this does not make me excited
> about "doing nothing".

Well our planner only needs to answer specific questions. We just
needs stats capable of answering "how many occurrences of x are there"
and "how many values are in the range x..y" for the normal estimation
functions. We have the latter but if there's a stat we're missing for
calculating the former more more robustly that would be great. We also
need ndistinct but that's another story.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to