On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > I'm sorry, but I don't see how we could do this for arbitrary clauses. I > think we could do that for clauses that have equality semantics and > reference column values as a whole. So I think it's possible to do this > for IN clauses (which is what the first part of the patch does), but I > don't think we can do it for the containment operator. > > I.e. we can do that for > > WHERE a IN (...) AND b IN (...) >
Hmm, the difficulty always comes back to the compatibility of the clauses though. It's easy to come up with artificial examples for which functional dependencies come up with bad estimates, even with just = and IN (...) operators. For example, given a perfect correlation like a | b ------- 1 | 1 2 | 2 3 | 3 : | : you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b IN (2,4,6,8,...)" to get a very bad estimate from functional dependencies. However, I don't think such artificial examples are that useful. I think you have to think in terms of real data distributions together with real queries expected to go with them. For example: Using the OP's original example of a multi-tenant system, you might well have a table with columns (product_type, tenant_id) and a functional dependency product_type => tenant_id. In that case, it could well be very useful in optimising queries like "WHERE product_type IN (X,Y,Z) AND tenant_id = 123". But this isn't necessarily limited to = and IN (...). For example, consider a table with UK-based geographic data with columns (location point, postcode text). Then there would be a strong functional dependency location => postcode (and possibly also the other way round, depending on how dense the points were). That dependency could be used to estimate much more general queries like "WHERE location <@ some_area AND postcode ~ '^CB.*'", where there may be no useful stats on location, but a histogram on postcode might give a reasonable estimate. This also extends to inequalities. For example a table with columns (weight, category) might have a strong functional dependency weight => category. Then a query like "WHERE weight > 10 AND weight < 20 AND category = 'large'" could get a decent estimate from a histogram on the weight column, and then use the functional dependency to note that that implies the category. Note that such an example would work with my patch from the other thread, because it groups clauses by column, and uses clauselist_selectivity_simple() on them. So in this case, the clauses "weight > 10 AND weight < 20" would be estimated together, and would be able to make use of the RangeQueryClause code. Of course, it's equally easy to come up with counter-example queries for any of those examples, where using the functional dependency would produce a poor estimate. Ultimately, it's up to the user to decide whether or not to build functional dependency statistics, and that decision needs to be based not just on the data distribution, but also on the types of queries expected. Given the timing though, perhaps it is best to limit this to IN (..) clauses for PG13, and we can consider other possibilities later. Regards, Dean