Hi,

Thanks for looking into this!

On 09/12/2016 04:08 PM, Dean Rasheed wrote:
On 3 August 2016 at 02:58, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
Attached is v19 of the "multivariate stats" patch series

Hi,

I started looking at this - just at a very high level - I've not read
much of the detail yet, but here are some initial review comments.

I think the overall infrastructure approach for CREATE STATISTICS
makes sense, and I agree with other suggestions upthread that it would
be useful to be able to build statistics on arbitrary expressions,
although that doesn't need to be part of this patch, it's useful to
keep that in mind as a possible future extension of this initial
design.

I can imagine it being useful to be able to create user-defined
statistics on an arbitrary list of expressions, and I think that would
include univariate as well as multivariate statistics. Perhaps that's
something to take into account in the naming of things, e.g., as David
Rowley suggested, something like pg_statistic_ext, rather than
pg_mv_statistic.

I also like the idea that this might one day be extended to support
statistics across multiple tables, although I think that might be
challenging to achieve -- you'd need a method of taking a random
sample of rows from a join between 2 or more tables. However, if the
intention is to be able to support that one day, I think that needs to
be accounted for in the syntax now -- specifically, I think it will be
too limiting to only support things extending the current syntax of
the form table1(col1, col2, ...), table2(col1, col2, ...), because
that precludes building statistics on an expression referring to
columns from more than one table. So I think we should plan further
ahead and use a syntax giving greater flexibility in the future, for
example something structured more like a query (like CREATE VIEW):

CREATE STATISTICS name
  [ WITH (options) ]
  ON expression [, ...]
  FROM table [, ...]
  WHERE condition

where the first version of the patch would only support expressions
that are simple column references, and would require at least 2 such
columns from a single table with no WHERE clause, i.e.:

CREATE STATISTICS name
  [ WITH (options) ]
  ON column1, column2 [, ...]
  FROM table

For multi-table statistics, a WHERE clause would typically be needed
to specify how the tables are expected to be joined, but potentially
such a clause might also be useful in single-table statistics, to
build partial statistics on a commonly queried subset of the table,
just like a partial index.

Hmm, the "partial statistics" idea seems interesting, It would allow us to provide additional / more detailed statistics only for a subset of a table.

I'm however not sure about the join case - how would the syntax work with outer joins? But as you said, we only need

 CREATE STATISTICS name
   [ WITH (options) ]
   ON (column1, column2 [, ...])
   FROM table
   WHERE condition

until we add support for join statistics.


Regarding the statistics themselves, I read the description of soft
functional dependencies, and I'm somewhat skeptical about that
algorithm. I don't like the arbitrary thresholds or the sudden jump
from independence to dependence and clause reduction. As others have
said, I think this should account for a continuous spectrum of
dependence from fully independent to fully dependent, and combine
clause selectivities in a way based on the degree of dependence. For
example, if you computed an estimate for the fraction 'f' of the
table's rows for which a -> b, then it might be reasonable to combine
the selectivities using

  P(a,b) = P(a) * (f + (1-f) * P(b))


Yeah, I agree that the thresholds resulting in sudden changes between "dependent" and "not dependent" are annoying. The question is whether it makes sense to fix that, though - the functional dependencies were meant as the simplest form of statistics, allowing us to get the rest of the infrastructure in.

I'm OK with replacing the true/false dependencies with a degree of dependency between 0 and 1, but I'm a bit afraid it'll result in complaints that the first patch got too large / complicated.

It also contradicts the idea of using functional dependencies as a low-overhead type of statistics, filtering the list of clauses that need to be estimated using more expensive types of statistics (MCV lists, histograms, ...). Switching to a degree of dependency would prevent removal of "unnecessary" clauses.

Of course, having just a single number that tells you the columns are
correlated, tells you nothing about whether the clauses on those
columns are consistent with that correlation. For example, in the
following table

CREATE TABLE t(a int, b int);
INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x);

'b' is functionally dependent on 'a' (and vice versa), but if you
query the rows with a<50 and with b<50, those clauses behave
essentially independently, because they're not consistent with the
functional dependence between 'a' and 'b', so the best way to combine
their selectivities is just to multiply them, as we currently do.

So whilst it may be interesting to determine that 'b' is functionally
dependent on 'a', it's not obvious whether that fact by itself should
be used in the selectivity estimates. Perhaps it should, on the
grounds that it's best to attempt to use all the available
information, but only if there are no more detailed statistics
available. In any case, knowing that there is a correlation can be
used as an indicator that it may be worthwhile to build more detailed
multivariate statistics like a MCV list or a histogram on those
columns.


Right. IIRC this is actually described in the README as "incompatible conditions". While implementing it, I concluded that this is OK and it's up to the developer to decide whether the queries are compatible with the "assumption of compatibility". But maybe this is reasoning is bogus and makes (the current implementation of) functional dependencies unusable in practice.

But I like the idea of reverting the order from

(a) look for functional dependencies
(b) reduce the clauses using functional dependencies
(c) estimate the rest using multivariate MCV/histograms

to

(a) estimate the rest using multivariate MCV/histograms
(b) try to apply functional dependencies on the remaining clauses

It contradicts the idea of functional dependencies as "low-overhead statistics" but maybe it's worth it.


Looking at the ndistinct coefficient 'q', I think it would be better
if the recorded statistic were just the estimate for
ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a
more fundamental statistic, and it's easier to document and easier to
interpret. Also, I don't believe that the coefficient 'q' is the right
number to use for clause estimation:


IIRC the reason why I stored the coefficient instead of the ndistinct() values is that the coefficients are not directly related to number of rows in the original relation, so you can apply it directly to whatever cardinality estimate you have.

Otherwise it's mostly the same information - it's trivial to compute one from the other.

>
Looking at README.ndistinct, I'm skeptical about the selectivity
estimation argument. In the case where a -> b, you'd have q =
ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which
is fine for a uniform distribution. But typically, there would be
univariate statistics on a and b, so if for example a=1 were 100x more
likely than average, you'd probably know that and the existing code
computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2)
= 1/ndistinct(a) would be a significant underestimate, since it would
be ignoring known information about the distribution of a.

But likewise if, as is later argued, you were to use 'q' as a
correction factor applied to the individual clause selectivities, you
could end up with significant overestimates: if you said P(a=1 & b=2)
= q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and
a -> b, then b=2 would also be 100x more likely than average (assuming
that b=2 was the value implied by the functional dependency), and that
would also be reflected in the univariate statics on b, so then you'd
end up with an overall selectivity of around 10000/ndistinct(a), which
would be 100x too big. In fact, since a -> b means that q =
ndistinct(b), there's a good chance of hitting data for which q * P(b)
is greater than 1, so this formula would lead to a combined
selectivity greater than P(a), which is obviously nonsense.

Well, yeah. The

    P(a=1) = 1/ndistinct(a)

was really just a simplification for the uniform distribution, and looking at "q" as a correction factor is much more practical - no doubt about that.

As for the overestimated and underestimates - I don't think we can entirely prevent that. We're essentially replacing one assumption (AVIA) with other assumptions (homogenity for ndistinct, compatibility for functional dependencies), hoping that those assumptions are weaker in some sense. But there'll always be cases that break those assumptions and I don't think we can prevent that.

Unlike the functional dependencies, this "homogenity" assumption is not dependent on the queries at all, so it should be possible to verify it during ANALYZE.

Also, maybe we could/should use the same approach as for functional dependencies, i.e. try using more detailed statistics first and then apply ndistinct coefficients only on the remaining clauses?


Having a better estimate for ndistinct(a,b,...) looks very useful by
itself for GROUP BY estimation, and there may be other places that
would benefit from it, but I don't think it's the best statistic for
determining functional dependence or combining clause selectivities.


Not sure. I think it may be very useful type of statistics, but I'm not going to fight for this very hard. I'm fine with ignoring this statistics type for now, getting the other "detailed" statistics types (MCV, histograms) in and then revisiting this.

That's as much as I've looked at so far. It's such a big patch that
it's difficult to consider all at once. I think perhaps the smallest
committable self-contained unit providing a tangible benefit would be
something containing the core infrastructure plus the ndistinct
estimate and the improved GROUP BY estimation.


FWIW I find the ndistinct statistics as rather uninteresting (at least compared to the other types of statistics), which is why it's the last patch in the patch series. Perhaps I shouldn't have include it at all, as it's just a distraction.


regards
Dean


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to