Re: PATCH: add support for IN and @> in functional-dependency statistics use

Dean Rasheed Fri, 13 Mar 2020 01:43:51 -0700

On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <[email protected]> wrote:
>
> I'm sorry, but I don't see how we could do this for arbitrary clauses. I
> think we could do that for clauses that have equality semantics and
> reference column values as a whole. So I think it's possible to do this
> for IN clauses (which is what the first part of the patch does), but I
> don't think we can do it for the containment operator.
>
> I.e. we can do that for
>
>      WHERE a IN (...) AND b IN (...)
>


Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

  a | b
 -------
  1 | 1
  2 | 2
  3 | 3
  : | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

However, I don't think such artificial examples are that useful. I
think you have to think in terms of real data distributions together
with real queries expected to go with them. For example:

Using the OP's original example of a multi-tenant system, you might
well have a table with columns (product_type, tenant_id) and a
functional dependency product_type => tenant_id. In that case, it
could well be very useful in optimising queries like "WHERE
product_type IN (X,Y,Z) AND tenant_id = 123".

But this isn't necessarily limited to = and IN (...). For example,
consider a table with UK-based geographic data with columns (location
point, postcode text). Then there would be a strong functional
dependency location => postcode (and possibly also the other way
round, depending on how dense the points were). That dependency could
be used to estimate much more general queries like "WHERE location <@
some_area AND postcode ~ '^CB.*'", where there may be no useful stats
on location, but a histogram on postcode might give a reasonable
estimate.

This also extends to inequalities. For example a table with columns
(weight, category) might have a strong functional dependency weight =>
category. Then a query like "WHERE weight > 10 AND weight < 20 AND
category = 'large'" could get a decent estimate from a histogram on
the weight column, and then use the functional dependency to note that
that implies the category. Note that such an example would work with
my patch from the other thread, because it groups clauses by column,
and uses clauselist_selectivity_simple() on them. So in this case, the
clauses "weight > 10 AND weight < 20" would be estimated together, and
would be able to make use of the RangeQueryClause code.

Of course, it's equally easy to come up with counter-example queries
for any of those examples, where using the functional dependency would
produce a poor estimate. Ultimately, it's up to the user to decide
whether or not to build functional dependency statistics, and that
decision needs to be based not just on the data distribution, but also
on the types of queries expected.

Given the timing though, perhaps it is best to limit this to IN (..)
clauses for PG13, and we can consider other possibilities later.

Regards,
Dean

Re: PATCH: add support for IN and @> in functional-dependency statistics use

Reply via email to