On Tue, Nov 25, 2008 at 06:59:25PM -0800, Dann Corbit wrote:
> I do have a statistics idea/suggestion (possibly useful with some future
> PostgreSQL 9.x or something):
> It is a simple matter to calculate lots of interesting univarate summary
> statistics with a single pass over the data (perhaps during a vacuum
> full).
> For instance with numerical columns, you can calculate mean, min, max,
> standard deviation, skew, kurtosis and things like that with a single
> pass over the data.

Calculating "interesting univariate summary statistics" and having
something useful to do with them are two different things entirely. Note
also that whereas this is simple for numeric columns, it's a very
different story for non-numeric data types, that don't come from a
metric space. That said, the idea of a probability metric space is well
explored in the literature, and may have valuable application. The
current histogram implementation is effectively a description of the
probability metric space the column data live in.

> Now, if you store a few numbers calculated in this way, it can be used
> to augment your histogram data when you want to estimate the volume of a
> request. So (for instance) if someone asks for a scalar that is ">
> value" you can look to see what percentage of the tail will hang out in
> that neck of the woods using standard deviation and the mean.

Only if you know that the data follow a distribution that can be
described accurately with a standard deviation and a mean. If your data
don't follow a Gaussian distribution, this will give you bad estimates.

- Josh / eggyknap

Attachment: signature.asc
Description: Digital signature

Reply via email to