Re: [HACKERS] multivariate statistics (v19)

Tomas Vondra Mon, 15 Aug 2016 13:51:06 -0700

On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
<[email protected]> wrote:

1) enriching the query tree with multivariate statistics info


Right now all the stuff related to multivariate statistics estimation
happens in clausesel.c - matching condition to statistics, selection of
statistics to use (if there are multiple usable stats), etc. So pretty much
all this info is internal to clausesel.c and does not get outside.


This does not seem bad to me as first sight but...

I'm starting to think that some of the steps (matching quals to stats,
selection of stats) should happen in a "preprocess" step before the actual
estimation, storing the information (which stats to use, etc.) in a new type
of node in the query tree - something like RestrictInfo.

I believe this needs to happen sometime after deconstruct_jointree() as that
builds RestrictInfos nodes, and looking at planmain.c, right after
extract_restriction_or_clauses seems about right. Haven't tried, though.

This would move all the "statistics selection" logic from clausesel.c,
separating it from the "actual estimation" and simplifying the code.

But more importantly, I think we'll need to show some of the data in EXPLAIN
output. With per-column statistics it's fairly straightforward to determine
which statistics are used and how. But with multivariate stats things are
often more complicated - there may be multiple candidate statistics (e.g.
histograms covering different subsets of the conditions), it's possible to
apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only within
clausesel.c (and thrown away after the estimation).


This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.

I've been thinking about this, and I'm afraid it's way more complicatedin practice. It essentially means doing something like


    rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);

for each table (and in the future maybe also for joins etc.) But as thename suggests the list should only include RestrictInfo nodes, whichseems to contradict the transformation.


For example with conditions

    WHERE (a=1) AND (b=2) AND (c=3)

the list will contain 3 RestrictInfos. But if there's a statistics on(a,b,c), we need to note that somehow - my plan was to inject a nodestoring this information, something like (a bit simplified):


    StatisticsInfo {
         Oid statisticsoid; /* OID of the statistics */
         List *mvconditions; /* estimate using the statistics */
         List *otherconditions; /* estimate the old way */
    }

But that'd clearly violate the assumption that baserestrictinfo onlycontains RestrictInfo. I don't think it's feasible (or desirable) torework all the places to expect both RestrictInfo and the new node.


I can think of two alternatives:

1) keep the transformed list as separate list, next to baserestrictinfo

This obviously fixes the issue, as each caller can decide which node itwants. But it also means we need to maintain two lists instead of one,and keep them synchronized.


2) embed the information into the existing tree

It might be possible to store the information in existing nodes, i.e.each node would track whether it's estimated the "old way" or usingmultivariate statistics (and which one). But it would require changingmany of the existing nodes (at least those compatible with multivariatestatistics: currently OpExpr, NullTest, ...).

And it also seems fairly difficult to reconstruct the information duringthe estimation, as it'd be necessary to look for other nodes to beestimated by the same statistics. Which seems to defeat the idea ofpreprocessing to some degree.

So I'm not sure what's the best solution. I'm leaning to (1), i.e.keeping a separate list, but I'd welcome other ideas.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

Reply via email to