# Re: [HACKERS] multivariate statistics (v19)

This patch set is in pretty good shape, the only problem is that it's so big that no-one seems to have the time or courage to do the final touches and commit it. If we just focus on the functional dependencies part for now, I think we might get somewhere. I peeked at the MCV and histogram patches too, and I think they make total sense as well, and are a natural extension of the functional dependencies patch. So if we just focus on that for now, I don't think we will paint ourselves in the corner.
```
(more below)```
```
On 09/14/2016 01:01 AM, Tomas Vondra wrote:
```
```On 09/12/2016 04:08 PM, Dean Rasheed wrote:
```
```Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

P(a,b) = P(a) * (f + (1-f) * P(b))

```
```
Yeah, I agree that the thresholds resulting in sudden changes between
"dependent" and "not dependent" are annoying. The question is whether it
makes sense to fix that, though - the functional dependencies were meant
as the simplest form of statistics, allowing us to get the rest of the
infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of
dependency between 0 and 1, but I'm a bit afraid it'll result in
complaints that the first patch got too large / complicated.
```
```
+1 for using a floating degree between 0 and 1, rather than a boolean.

```
```It also contradicts the idea of using functional dependencies as a
low-overhead type of statistics, filtering the list of clauses that need
to be estimated using more expensive types of statistics (MCV lists,
histograms, ...). Switching to a degree of dependency would prevent
removal of "unnecessary" clauses.
```
```
That sounds OK to me, although I'm not deeply familiar with this patch yet.

```
```Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.
```
```
Right. IIRC this is actually described in the README as "incompatible
conditions". While implementing it, I concluded that this is OK and it's
up to the developer to decide whether the queries are compatible with
the "assumption of compatibility". But maybe this is reasoning is bogus
and makes (the current implementation of) functional dependencies
unusable in practice.
```
```
```
I think that's OK. It seems like a good assumption that the conditions are "compatible" with the functional dependency. For two reasons:
```
```
1) A query with compatible clauses is much more likely to occur in real life. Why would you run a query with an incompatible ZIP and city clauses?
```
```
2) If the conditions were in fact incompatible, the query is likely to return 0 rows, and will bail out very quickly, even if the estimates are way off and you choose a non-optimal plan. There are exceptions, of course: an index scan might be able to conclude that there are no rows much quicker than a seqscan, but as a general rule of thumb, a query that returns 0 rows isn't very sensitive to the chosen plan.
```
```
And of course, as long as we're not collecting these statistics automatically, if it doesn't work for your application, just don't collect them.
```

```
I fear that using "statistics" as the name of the new object might get a bit awkward. "statistics" is a plural, but we use it as the name of a single object, like "pants" or "scissors". Not sure I have any better ideas though. "estimator"? "statistics collection"? Or perhaps it should be singular, "statistic". I note that you actually called the system table "pg_mv_statistic", in singular.
```
```
I'm not a big fan of storing the stats as just a bytea blob, and having to use special functions to interpret it. By looking at the patch, it's not clear to me what we actually store for functional dependencies. A list of attribute numbers? Could we store them simply as an int[]? (I'm not a big fan of the hack in pg_statistic, that allows storing arrays of any data type in the same column, though. But for functional dependencies, I don't think we need that.)
```
Overall, this is going to be a great feature!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
```