> On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <f...@phlo.org> wrote: >> You might use that to decide if either A->B or B->a looks function-like >> enough to use the uniform bayesian approach. Or you might even go >> further, >> and decide *with* bayesian formula to use - the paper you cited always >> averages >> >> P(A=x|B=y)*P(B=y) and >> P(B=y|A=x)*P(A=x) >> >> but they offer no convincing reason for that other than "We don't know >> which to pick". > > Ideally you want to somehow make this a continuous transaition between > the available formulas rather than a discrete transition, I think. If > F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and > if it's 0, then it's P(A=x)*P(B=y). But suppose F(A,B)=0.5. Then > what? A naive approach would be to estimate P(A=x && B=y) = P(A=x) * > (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and > P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1 > we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9) > = 0.055. Of course I'm just hand-waving here, and this is without any > mathematical basis, being just the simplest formula I could think of > that gets the endpoints right and plots some sort of smooth curve > between them in the middle. A similar formula with a believable > argument to back it up seems like it would be a big step forward for > this method.
This somehow reminds me how the various t-norms in fuzzy logic evolved. I'm not saying we should use fuzzy logic here, but the requirements are very similar so it might be an interesting inspiration. See for example this http://plato.stanford.edu/entries/logic-fuzzy (chapter 4). And there's one additional - IMHO very important - requirement. The whole thing should easily extend to more than two columns. This "IF (F(A,B) > F(B,A)) THEN ..." probably is not a good solution regarding this. For example given 3 columns A,B,C, would you do that comparison for each pair of columns, or would you do that for A vs (B,C)? Or maybe a completely different approach? Because that would require to collect a lot more data (number of distinct values in each combination) etc. I'm not saying for example there is a table with (C=A+B) A | B | C =========== 1 | 1 | 2 1 | 2 | 3 1 | 3 | 4 2 | 1 | 3 2 | 2 | 4 2 | 3 | 5 3 | 1 | 4 3 | 2 | 5 3 | 3 | 6 So that dist(A)=dist(B)=3, dist(C)=6 and dist(A,B,C)=dist(A,B)=9. Given the paper, you get something like P(A,B,C) = [dist(A)*P(A) + dist(B)*P(B) + dist(C)*P(C)] / [3*dist(A,B,C)] = [P(A) + P(B) + 2*P(C)] / 9 so for example P(A=3,B=2,C=5) = [1/3 + 1/3 + 2/9]/9 = (8/9)/9 which is almost correct (by 1/81). Don't get me wrong - I'm not a fanatic who thinks this particular formula is the best formula possible. I'm just saying we could end up with a formula that works beautifully in 2D, but once we get to 3 columns it fails miserably. Hmmm, maybe we could give this possibility (to identify two separate groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the user would say 'build stats for (A,B) and (C)' - this actually represents apriori knowledge of dependencies supplied by the user. In that case we could search for 'implicativeness' between those two groups (and not within the groups), and we could restrict ourselves to 2D (and thus use a more sophisticated formula). But we should be able to do some basic analysis even when the user supplies a list of columns without such apriori knowledge. regards Tomas -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers