# Re: [HACKERS] proposal : cross-column stats

```> On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <f...@phlo.org> wrote:
>> You might use that to decide if either A->B or B->a looks function-like
>> enough to use the uniform bayesian approach. Or you might even go
>> further,
>> and decide *with* bayesian formula to use - the paper you cited always
>> averages
>>
>>  P(A=x|B=y)*P(B=y) and
>>  P(B=y|A=x)*P(A=x)
>>
>> but they offer no convincing reason for that other than "We don't know
>> which to pick".
>
> Ideally you want to somehow make this a continuous transaition between
> the available formulas rather than a discrete transition, I think.  If
> F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
> if it's 0, then it's P(A=x)*P(B=y).  But suppose F(A,B)=0.5.  Then
> what?  A naive approach would be to estimate P(A=x && B=y) = P(A=x) *
> (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
> P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
> we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
> = 0.055.  Of course I'm just hand-waving here, and this is without any
> mathematical basis, being just the simplest formula I could think of
> that gets the endpoints right and plots some sort of smooth curve
> between them in the middle.  A similar formula with a believable
> argument to back it up seems like it would be a big step forward for
> this method.```
```
This somehow reminds me how the various t-norms in fuzzy logic evolved.
I'm not saying we should use fuzzy logic here, but the requirements are
very similar so it might be an interesting inspiration. See for example
this http://plato.stanford.edu/entries/logic-fuzzy (chapter 4).

And there's one additional - IMHO very important - requirement. The whole
thing should easily extend to more than two columns. This "IF (F(A,B) >
F(B,A)) THEN ..." probably is not a good solution regarding this.

For example given 3 columns A,B,C, would you do that comparison for each
pair of columns, or would you do that for A vs (B,C)? Or maybe a
completely different approach? Because that would require to collect a lot
more data (number of distinct values in each combination) etc.

I'm not saying for example there is a table with (C=A+B)

A | B | C
===========
1 | 1 | 2
1 | 2 | 3
1 | 3 | 4
2 | 1 | 3
2 | 2 | 4
2 | 3 | 5
3 | 1 | 4
3 | 2 | 5
3 | 3 | 6

So that dist(A)=dist(B)=3, dist(C)=6 and dist(A,B,C)=dist(A,B)=9. Given
the paper, you get something like

P(A,B,C) = [dist(A)*P(A) +  dist(B)*P(B) + dist(C)*P(C)] / [3*dist(A,B,C)]
= [P(A) + P(B) + 2*P(C)] / 9

so for example

P(A=3,B=2,C=5) = [1/3 + 1/3 + 2/9]/9 = (8/9)/9

which is almost correct (by 1/81).

Don't get me wrong - I'm not a fanatic who thinks this particular formula
is the best formula possible. I'm just saying we could end up with a
formula that works beautifully in 2D, but once we get to 3 columns it
fails miserably.

Hmmm, maybe we could give this possibility (to identify two separate
groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
user would say 'build stats for (A,B) and (C)' - this actually represents
apriori knowledge of dependencies supplied by the user.

In that case we could search for 'implicativeness' between those two
groups (and not within the groups), and we could restrict ourselves to 2D
(and thus use a more sophisticated formula).

But we should be able to do some basic analysis even when the user
supplies a list of columns without such apriori knowledge.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
```