Re: [HACKERS] proposal : cross-column stats

Tomas Vondra Fri, 24 Dec 2010 05:51:10 -0800

Dne 24.12.2010 04:41, Florian Pflug napsal(a):
> The filter size could be derived from the table's statistics target, or
> be otherwise user-definable. We could also auto-resize once it gets too
> full. But still, that all seems awfully complex :-(


Using a statistics target is a good idea I think. I think we could use
it to determine error rate of the filter. Something like

   error rate = 10 - 0.9 * (statistics_target - 100)

which gives

   1%  for statistics target 1000
   10% for statistics target 100

or maybe something like this (where the error rate grows faster for
smaller statistic target values)

   error rate = 11 - 91000 / (statistics_target^2)

which gives about

   1%  for statistics target 1000
   10% for statistics targer 100
   36% for statistics target 50

But I guess 10% error rate is the minimum we need so it does not make
much sense to use lower values.

>> >   Another possibility is to collect the data from just a small portion
>> >   of a table and then use the result to estimate the number of distinct
>> >   values for the whole table. But I'm not sure we can do this reliably,
>> >   I see many traps in this.
> This is how it works currently. The problem with this approach is that
> it gives you very little guarantees about how precise the result will be.
> Extrapolating works very well for things like MKVs and histograms, because
> there you're by definition interested mostly in values which occur often -
> and thus with a high probability in the relative few rows you sample. For
> the number of distinct values, however, this isn't true - if ndistinct
> is an order of magnitude smaller than the number of rows, relatively few
> rows can account for a large percentage of the distinct values...

That basically means we need to sample a large portion of the table :-(

> Another idea would be to obtain the ndistinct values from an index somehow.
> Postgres cannot currently scan an index in physical order, only in logical
> order, due to locking considerations. But since we'd only be interested in
> an estimate, maybe a scan in physical block order would work for ndistinc
> estimates? Just a wild idea, mind you, I haven't checked at all if that'd
> be even remotely feasible.

I was thinking about that too, and I think we could do this using
pageinspect contrib module. Sure, there might be a problem with bloated
indexes.

And relying on this actually means it's required to have a multi-column
index on all the columns. Individual indexes are not enough as we need
to get the number of distinct combinations too.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to