On 08/10/2016 06:41 AM, Michael Paquier wrote:
On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
1) enriching the query tree with multivariate statistics info
Right now all the stuff related to multivariate statistics estimation
happens in clausesel.c - matching condition to statistics, selection of
statistics to use (if there are multiple usable stats), etc. So pretty much
all this info is internal to clausesel.c and does not get outside.
This does not seem bad to me as first sight but...
I'm starting to think that some of the steps (matching quals to stats,
selection of stats) should happen in a "preprocess" step before the actual
estimation, storing the information (which stats to use, etc.) in a new type
of node in the query tree - something like RestrictInfo.
I believe this needs to happen sometime after deconstruct_jointree() as that
builds RestrictInfos nodes, and looking at planmain.c, right after
extract_restriction_or_clauses seems about right. Haven't tried, though.
This would move all the "statistics selection" logic from clausesel.c,
separating it from the "actual estimation" and simplifying the code.
But more importantly, I think we'll need to show some of the data in EXPLAIN
output. With per-column statistics it's fairly straightforward to determine
which statistics are used and how. But with multivariate stats things are
often more complicated - there may be multiple candidate statistics (e.g.
histograms covering different subsets of the conditions), it's possible to
apply them in different orders, etc.
But EXPLAIN can't show the info if it's ephemeral and available only within
clausesel.c (and thrown away after the estimation).
This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.
I've been thinking about this, and I'm afraid it's way more complicated
in practice. It essentially means doing something like
rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);
for each table (and in the future maybe also for joins etc.) But as the
name suggests the list should only include RestrictInfo nodes, which
seems to contradict the transformation.
For example with conditions
WHERE (a=1) AND (b=2) AND (c=3)
the list will contain 3 RestrictInfos. But if there's a statistics on
(a,b,c), we need to note that somehow - my plan was to inject a node
storing this information, something like (a bit simplified):
Oid statisticsoid; /* OID of the statistics */
List *mvconditions; /* estimate using the statistics */
List *otherconditions; /* estimate the old way */
But that'd clearly violate the assumption that baserestrictinfo only
contains RestrictInfo. I don't think it's feasible (or desirable) to
rework all the places to expect both RestrictInfo and the new node.
I can think of two alternatives:
1) keep the transformed list as separate list, next to baserestrictinfo
This obviously fixes the issue, as each caller can decide which node it
wants. But it also means we need to maintain two lists instead of one,
and keep them synchronized.
2) embed the information into the existing tree
It might be possible to store the information in existing nodes, i.e.
each node would track whether it's estimated the "old way" or using
multivariate statistics (and which one). But it would require changing
many of the existing nodes (at least those compatible with multivariate
statistics: currently OpExpr, NullTest, ...).
And it also seems fairly difficult to reconstruct the information during
the estimation, as it'd be necessary to look for other nodes to be
estimated by the same statistics. Which seems to defeat the idea of
preprocessing to some degree.
So I'm not sure what's the best solution. I'm leaning to (1), i.e.
keeping a separate list, but I'd welcome other ideas.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: