Re: [HACKERS] multivariate statistics (v19)

Heikki Linnakangas Fri, 30 Sep 2016 04:11:10 -0700

This patch set is in pretty good shape, the only problem is that it's sobig that no-one seems to have the time or courage to do the finaltouches and commit it. If we just focus on the functional dependenciespart for now, I think we might get somewhere. I peeked at the MCV andhistogram patches too, and I think they make total sense as well, andare a natural extension of the functional dependencies patch. So if wejust focus on that for now, I don't think we will paint ourselves in thecorner.


(more below)


On 09/14/2016 01:01 AM, Tomas Vondra wrote:

On 09/12/2016 04:08 PM, Dean Rasheed wrote:

Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

  P(a,b) = P(a) * (f + (1-f) * P(b))


Yeah, I agree that the thresholds resulting in sudden changes between
"dependent" and "not dependent" are annoying. The question is whether it
makes sense to fix that, though - the functional dependencies were meant
as the simplest form of statistics, allowing us to get the rest of the
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of
dependency between 0 and 1, but I'm a bit afraid it'll result in
complaints that the first patch got too large / complicated.


+1 for using a floating degree between 0 and 1, rather than a boolean.

It also contradicts the idea of using functional dependencies as a
low-overhead type of statistics, filtering the list of clauses that need
to be estimated using more expensive types of statistics (MCV lists,
histograms, ...). Switching to a degree of dependency would prevent
removal of "unnecessary" clauses.


That sounds OK to me, although I'm not deeply familiar with this patch yet.

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.


Right. IIRC this is actually described in the README as "incompatible
conditions". While implementing it, I concluded that this is OK and it's
up to the developer to decide whether the queries are compatible with
the "assumption of compatibility". But maybe this is reasoning is bogus
and makes (the current implementation of) functional dependencies
unusable in practice.

I think that's OK. It seems like a good assumption that the conditionsare "compatible" with the functional dependency. For two reasons:

1) A query with compatible clauses is much more likely to occur in reallife. Why would you run a query with an incompatible ZIP and city clauses?

2) If the conditions were in fact incompatible, the query is likely toreturn 0 rows, and will bail out very quickly, even if the estimates areway off and you choose a non-optimal plan. There are exceptions, ofcourse: an index scan might be able to conclude that there are no rowsmuch quicker than a seqscan, but as a general rule of thumb, a querythat returns 0 rows isn't very sensitive to the chosen plan.

And of course, as long as we're not collecting these statisticsautomatically, if it doesn't work for your application, just don'tcollect them.

I fear that using "statistics" as the name of the new object might get abit awkward. "statistics" is a plural, but we use it as the name of asingle object, like "pants" or "scissors". Not sure I have any betterideas though. "estimator"? "statistics collection"? Or perhaps it shouldbe singular, "statistic". I note that you actually called the systemtable "pg_mv_statistic", in singular.

I'm not a big fan of storing the stats as just a bytea blob, and havingto use special functions to interpret it. By looking at the patch, it'snot clear to me what we actually store for functional dependencies. Alist of attribute numbers? Could we store them simply as an int[]? (I'mnot a big fan of the hack in pg_statistic, that allows storing arrays ofany data type in the same column, though. But for functionaldependencies, I don't think we need that.)


Overall, this is going to be a great feature!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

Reply via email to