Re: [HACKERS] Cross-column statistics revisited

Joshua Tolley Fri, 17 Oct 2008 17:48:15 -0700

On Fri, Oct 17, 2008 at 3:47 PM, Nathan Boley <[EMAIL PROTECTED]> wrote:
>>>> Right now our
>>>> "histogram" values are really quantiles; the statistics_target T for a
>>>> column determines a number of quantiles we'll keep track of, and we
>>>> grab values from into an ordered list L so that approximately 1/T of
>>>> the entries in that column fall between values L[n] and L[n+1]. I'm
>>>> thinking that multicolumn statistics would instead divide the range of
>>>> each column up into T equally sized segments,
>>>
>>> Why would you not use the same histogram bin bounds derived for the
>>> scalar stats (along each axis of the matrix, of course)?  This seems to
>>> me to be arbitrarily replacing something proven to work with something
>>> not proven.  Also, the above forces you to invent a concept of "equally
>>> sized" ranges, which is going to be pretty bogus for a lot of datatypes.
>>
>> Because I'm trying to picture geometrically how this might work for
>> the two-column case, and hoping to extend that to more dimensions, and
>> am finding that picturing a quantile-based system like the one we have
>> now in multiple dimensions is difficult. I believe those are the same
>> difficulties Gregory Stark mentioned having in his first post in this
>> thread. But of course that's an excellent point, that what we do now
>> is proven. I'm not sure which problem will be harder to solve -- the
>> weird geometry or the "equally sized ranges" for data types where that
>> makes no sense.
>>
>
> Look at copulas. They are a completely general method of describing
> the dependence between two marginal distributions. It seems silly to
> rewrite the stats table in terms of joint distributions when we'll
> still need the marginals anyways. Also, It might be easier to think of
> the dimension reduction problem in that form.
>


I'm still working my way around the math, but copulas sound better
than anything else I've been playing with. What's more, there are
plenty of existing implementations to refer to, provided it's done in
a licensing-friendly way. A multidimensional extension of our existing
stuff, at least in the ways I've been thinking of it, quickly becomes
a recursive problem -- perhaps some dynamic programming solution would
solve it, but copulas seem the more common solution for similar
problems in other fields. Thanks.

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cross-column statistics revisited

Reply via email to