Re: [HACKERS] proposal : cross-column stats

Florian Pflug Mon, 20 Dec 2010 18:29:29 -0800

On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
> It seems to me you're missing one very important thing - this was not
> meant as a new default way to do estimates. It was meant as an option
> when the user (DBA, developer, ...) realizes the current solution gives
> really bad estimates (due to correlation). In that case he could create
> 'cross-column' statistics on those columns, and the optimizer would use
> that info to do the estimates.


I do understand that. I just have the nagging feeling that there is a
way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
to apply the uniform bayesian approach or to assume the columns are
unrelated.

I play with this for a bit over the weekend, but unfortunately ran out
of time. So I'm writing up what I found, to prevent it from getting lost.

I tried to pick up Robert's idea of quantifying "Implicativeness" -
i.e., finding a number between 0 and 1 that describes how close the
(A,B) are to representing a function A -> B.

Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the
estimates of dist(?) are consistent. From that you easily get

  dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and
  dist(A,B)/dist(A) <= dist(B) <= dist(A,B)

If dist(A) == dist(A,B), then there is a functional dependency
A -> B, and conversely if dist(B) == dist(A,B) there is a functional
dependency B -> A. Note that you can have both at the same time!

On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
smallest number of distinct values possible for a given combination
of dist(A,B) and dist(A). This is the anti-function case.

This motivates the definition

  F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]

(You can probably drop the "-1", it doesn't make much of a difference
for larger values of dist(B).

F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
a value of 1 indicates that dist(A) == dist(A,B).

So F(A,B) is a suitable measure of "Implicativeness" - it's higher
if the table (A,B) looks more like a function A -> B.

You might use that to decide if either A->B or B->a looks function-like
enough to use the uniform bayesian approach. Or you might even go further,
and decide *with* bayesian formula to use - the paper you cited always
averages

  P(A=x|B=y)*P(B=y) and
  P(B=y|A=x)*P(A=x)

but they offer no convincing reason for that other than "We don't know
which to pick".

I'd like to find a statistical explanation for that definition of
F(A,B), but so far I couldn't come up with any. I created a Maple 14
worksheet while playing around with this - if you happen to have a
copy of Maple available I'd be happy to send it to you..

This is what I got so far - I hope it may prove to be of use somehow.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to