On Wed, 17 Mar 2021 at 17:26, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > My concern is that the current behavior (where we prefer expression > stats over multi-column stats to some extent) works fine as long as the > parts are independent, but once there's dependency it's probably more > likely to produce underestimates. I think underestimates for grouping > estimates were a risk in the past, so let's not make that worse. >
I'm not sure the current behaviour really is preferring expression stats over multi-column stats. In this example, where we're grouping by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of those multi-column stats actually match more than one column/expression. If anything, I'd go the other way and say that it was wrong to use the [(a+b),c] stats in the first case, where they were the only stats available, since those stats aren't really applicable to (c+d), which probably ought to be treated as independent. IOW, it might have been better to estimate the first case as ndistinct((a+b)) * ndistinct(c) * ndistinct(d) and the second case as ndistinct((a+b)) * ndistinct((c+d)) Regards, Dean