On Dec21, 2010, at 13:25 , t...@fuzzy.cz wrote:
> And there's one additional - IMHO very important - requirement. The whole
> thing should easily extend to more than two columns. This "IF (F(A,B) >
> F(B,A)) THEN ..." probably is not a good solution regarding this.
> 
> For example given 3 columns A,B,C, would you do that comparison for each
> pair of columns, or would you do that for A vs (B,C)? Or maybe a
> completely different approach? Because that would require to collect a lot
> more data (number of distinct values in each combination) etc.

That's certainly a valid concern. The uniform bayesian approach avoids that
quite beautifully...

> Hmmm, maybe we could give this possibility (to identify two separate
> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
> user would say 'build stats for (A,B) and (C)' - this actually represents
> apriori knowledge of dependencies supplied by the user.
> 
> In that case we could search for 'implicativeness' between those two
> groups (and not within the groups), and we could restrict ourselves to 2D
> (and thus use a more sophisticated formula).

Hm, I hated this idea at first, but I'm starting to like it more and more.
It *does* seem rather unrealistic that a user would know that a bunch of
columns are correlated, but have no idea in what way... 

Any examples when this's be the case would be very much appreciated - Maybe
we should ask around on -general about this?

> But we should be able to do some basic analysis even when the user
> supplies a list of columns without such apriori knowledge.

That, I think, overcomplicates things, at least for a first cut.

To summarize, I think you've shown quite nicely that the uniform bayesian
approach is a very sensible first step towards better estimates in the case
of correlated columns. It's statistically sound, and the dist(A,B) estimates
it requires are probably a necessary ingredient of any solution to the problem.
If we can make it degrade more gracefully if the columns are uncorrelated we
should do that, but if we can't thats still no reason to drop the whole idea.

So I guess we should turn our attention to how we'd obtain reasonably good 
estimates
of dist(A,B), and return to the current discussion once the other pieces are in 
place.

I think that maybe it'd be acceptable to scan a large portion of the
table to estimate dist(A,B), *if* we must only do so only once in a while. But 
even with
a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all 
values in memory,
and spilling them into, say, an on-disk hash table adds even more overhead to 
the already
expensive full scan. Maybe using a bloom filter instead of a hash table could 
avoid
the spilling to disk, in exchange for a slightly less precise result...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to