Hi all, Attached is a rebased version of this patch series, mostly just fixing the breakage caused by reworked format of initial catalog data.
Aside from that, the MCV building now adopts the logic introduced by commit b5db1d93d2 for single-column MCV lists. The new algorithm seems pretty good and I don't see why multi-column MCV lists should use something special. I'm sure there are plenty of open questions to discuss, particularly stuff related to combining the various types of statistics to the final estimate (a lot of that was already improved based on Dean's reviews). On thing that occurred to me while comparing the single-column logic (as implemented in selfuncs.c) and the new multi-column stuff, is dealing with partially-matching histogram buckets. In the single-column case, we pretty much assume uniform distribution in each bucket, and linearly interpolate the selectivity. So for a bucket with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x < 7" we return 0.7 and so on. This is what convert_to_scalar() does. In the multi-column case, we simply count each matching bucket as 0.5, without any attempts to linearly interpolate. It would not be difficult to call "convert_to_scalar" for each condition (essentially repeating the linear interpolation for each column), but then what? We could simply compute a product of those results, of course, but that only works assuming independence. And that's exactly the wrong thing to assume here, considering the extended statistics are meant for cases where the columns are not independent. So I'd argue the 0.5 estimate for partially-matching buckets is the right thing to do here, as it's minimizing the average error. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
0001-multivariate-MCV-lists.patch.gz
Description: application/gzip
0002-multivariate-histograms.patch.gz
Description: application/gzip