On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote: >> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >>> Well, not really - I haven't done any experiments with it. For two >>> columns selectivity equation is >>> >>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >>> >>> where A and B are columns, dist(X) is number of distinct values in >>> column X and sel(X) is selectivity of column X. >> >> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely, >> if A and B are independent, the formula must reduce to sel(A) * sel(B), >> and I cannot see how that'd work with the formula above. > > Yes, it's a selectivity estimate for P(A=a and B=b). It's based on > conditional probability, as > > P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a) > > and "uniform correlation" assumption so that it's possible to replace the > conditional probabilities with constants. And those constants are then > estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).
Ok, I think I understand this now. The selectivity equation actually *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate for sel(A). Take the clause "A = x AND B = y" for example. Without knowing anything about x and y, reasonable guesses for sel(A=x) and sel(B=y) are sel(A=x) = 1 / dist(A) sel(B=y) = 1 / dist(B). This is also what we do currently, according to var_eq_non_const() in src/backend/utils/adt/selfuncs.c, if we don't have any additional knowledge about x (Actually, we also factor the probability of A being NULL into this). With these estimates, your formula becomes sel(A=x,B=y) = 1 / dist(A,B). and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus sel(A=x,B=y) = sel(A=x) * sel(B=y). If, however, y is a constant, then we use the MKVs to estimate sel(B=y) (var_eq_const() in src/backend/utils/adt/selfuncs.c). If sel(B=y) ~= 0, we'd currently also conclude that sel(A=x,B=y) ~= 0. With the "uniform correlation" approach, we'd instead estimate sel(A=x,B=y) ~= sel(A=x) / dist(B) assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated. If dist(B) is small, this estimate is much worse than what we'd currently get, since we've effectively ignored the information that the restriction B=y alone guarantees that only very few rows will match. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers