On Wed, 17 Mar 2021 at 21:31, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > I agree applying at least the [(a+b),c] stats is probably the right > approach, as it means we're considering at least the available > information about dependence between the columns. > > I think to improve this, we'll need to teach the code to use overlapping > statistics, a bit like conditional probability. In this case we might do > something like this: > > ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))
Yes, I was thinking the same thing. That would be equivalent to applying a multiplicative "correction" factor of ndistinct(a,b,c,...) / ( ndistinct(a) * ndistinct(b) * ndistinct(c) * ... ) for each multivariate stat applicable to more than one column/expression, regardless of whether those columns were already covered by other multivariate stats. That might well simplify the implementation, as well as probably produce better estimates. > But that's clearly a matter for a future patch, and I'm sure there are > cases where this will produce worse estimates. Agreed. > Anyway, I plan to go over the patches one more time, and start pushing > them sometime early next week. I don't want to leave it until the very > last moment in the CF. +1. I think they're in good enough shape for that process to start. Regards, Dean