Hi! Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a): >> Lets talk about one special case - I'll explain how the proposed >> solution works, and then I'll explain how to make it more general, what >> improvements are possible, what issues are there. Anyway this is by no >> means a perfect or complete solution - it's just a starting point. > > It looks like you handled most of the issues. Just a few points: > > - This is obviously applicable to more than just integers, probably > anything with a b-tree operator class. What you've coded seems rely > on calculations on the values. Have you thought about how it could > work for, for example, strings?
Yes, I know, I just forgot to address this in my previous e-mail. The contingency tables have a really nice feature - they are based on splitting the sets into groups (~ bins of the histograms for each column). And this can be done if you can sort the values, you really don't need any calculations. So it should work with strings. And another thing I somehow forgot is handling the case when there is no histogram, just MCV. That's mostly the same - each of the values might be a separate group, or the values might be grouped to form less groups, etc. > The classic failure case has always been: postcodes and city names. > Strongly correlated, but in a way that the computer can't easily see. > Not that I suggest you fix this, but it's food for though. Though > strictly speaking this is a different kind of correlation than what > you're looking at. Hmmm, I see. I think the proposal does not fix this particular case, although it might improve the situation a little bit (limit the error between expected and observed number of rows). The problem is that once we get to a cell-level of the contingency table, there is no additional (more detailed) information. So we're stuck with the multiplication estimate, or something like that. I was thinking about it actually, and I think we could collect some more info - a correlation coefficient for each bin, or something like that. But that was not part of my proposal, and I'm not sure how to do that. >> 2) I really don't think we should collect stats for all combinations of >> columns of a table - I do like the Oracle-like approach where a DBA >> has to enable cross-column stats using an ALTER TABLE (for a >> particular list of columns). >> >> The only exception might be columns from a multi-column index. It >> might be quite efficient I guess? > > In the past it has been suggested to only do it for multi-column > indexes, but I find these days I find in some situations I prefer to > make individual indexes and let the bitmap scan code combine them. So > perhaps it would be best to let it be configured by the DBA. Yes, I prefer individual indexes too. The idea behind collecting cross-column stats for multi-column indexes was that maybe we could 'append' this to the current functionality (building the index or something like that) so that it does not introduce significant performance problems. >> 3) There are independence tests for contingency tables (e.g. Pearson's >> Chi-squared test), so that it's easy to find out whether the columns >> are independent. In that case we can just throw away these stats and >> use the simple estimation. >> >> http://mathworld.wolfram.com/Chi-SquaredTest.html > > I think this would be good to include, if possible. > > Actually, I wonder if the existing stats collection code could be > altered to attempt to calculate the correlation between columns as part > of its other work. I guess that would be rather expensive - to compute correlation you need two passes, and you need to do that for each pair or columns. So I'd be surprised if it is possible (and effective). Another thing is that you can compute correlation only for numeric columns, so it's not possible to do that for city/ZIP code mentioned above. More precisely - it's possible to do that (if you map strings to numbers somehow), but I doubt you'll get useful results as the assignment is rather random. Well, you could ask the governments to assign the ZIP codes to cities in strictly alphabecital order, but I guess they'll say no. >> 4) Or we could store just those cells where expected and observed values >> differ significantly (may help if most of the values are indendent, >> but there's a small glitch somewhere). > > Comrpessing that grid would be useful, given that for many dimensions > most of the grid will be not interesting. In fact, storing the 20 > largest values may be enough. Worth an experiment. Not exactly just the largest values - rather values that are significantly different from the expected values. Generally there are two interesting cases expected << observed - The optimizer may choose index scan, although the seq scan would be better. expected >> observed - The optimizer may choose seq scan, although the index scan would be better. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers