Re: [HACKERS] multivariate statistics / patch v7

Tomas Vondra Sat, 25 Jul 2015 14:10:11 -0700

Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:

Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.


Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.

FWIW I've rebased my patch to the current master, it's available ongithub as usual:


    https://github.com/tvondra/postgres/commits/mvstats

The code to find mv-applicable clause is moved out of the main
flow of clauselist_selectivity. As I said in the previous mail,
the new function transformRestrictInfoForEstimate (too bad name
but just for PoC:) scans clauselist and generates
RestrictStatsData struct which drives mv-aware selectivity
calculation. This struct isolates MV and non-MV estimation.

The struct RestrictStatData mainly consists of the following
three parts,

   - clause to be estimated by current logic (MV is not applicable)
   - clause to be estimated by MV-staistics.
   - list of child RestrictStatDatas, which are to be run
     recursively.

mvclause_selectivty() is the topmost function where mv stats
works. This structure effectively prevents main estimation flow
from being broken by modifying mvstats part. Although I haven't
measured but I'm positive the code is far reduced from yours.

I attached two patches to this message. The first one is to
rebase v7 patch to current(maybe) master and the second applies
the refactoring.

I'm a little anxious about performance but I think this makes the
process to apply mv-stats far clearer. Regtests for mvstats
succeeded asis except for fdep, which is not implememted in this
patch.

What do you think about this?

I'm not sure, at this point. I'm having a hard time understanding howexactly the code works - there are pretty much no comments explainingthe implementation, so it takes time to understand the code. This isespecially true about transformRestrictInfoForEstimate which is alsoquite long. I understand it's a PoC, but comments would really help.

On a conceptual level, I think the idea to split the estimation into twophases - enrich the expression tree with nodes with details about statsetc, and then actually do the estimation in the second phase might beinteresting. Not because it's somehow clearer, but because it gives us achance to see the expression tree as a whole, with details about all thestats (with the current code we process/estimate the treeincrementally). But I don't really know how useful that would be.

I don't think the proposed change makes the process somehow clearer. Iknow it's a PoC at this point, so I don't expect it to be perfect, butfor me the original code is certainly clearer. Of course, I'm biased asI wrote the current code, and I (naturally) shaped it to match my ideasduring the development process, and I'm much more familiar with it.

Omitting the support for functional dependencies is a bit unfortunate, Ithink. Is that merely to make the PoC simpler, or is there somethingthat makes it impossible to support that kind of stats?

Another thing that I noticed is that you completely removed the codethat combined multiple stats (and selected the best combination ofstats). In other words, you've reverted to the intermediate singlestatistics approach, including removing the improved handling of ORclauses and conditions. It's a bit difficult to judge the proposedapproach not knowing how well it supports those (quite crucial)features. What if it can't support some them., or what if it makes thecode much more complicated (thus defeating the goal of making it moreclear)?

I share your concern about the performance impact - one thing is thatthis new code might be slower than the original one, but a more seriousissue IMHO is that the performance impact will happen even for relationswith no multivariate stats at all. The original patch was very carefulabout getting ~0% overhead in such cases, and if the new code does notallow that, I don't see this approach as acceptable. We must not putadditional overhead on people not using multivariate stats.

But I think it's worth exploring this idea a bit more - can you rebaseit to the current patch version (as on github) and adding the missingpieces (functional dependencies, multi-statistics estimation and passingconditions)?

One more thing - I noticed you extended the pg_operator catalog with aoprmvstat attribute, used to flag operators that are compatible withmultivariate stats. I'm not happy with the current approach (usingoprrest to do this decision), but I'm not really sure this is a goodsolution either. The culprit is that it only answers one of the twoimportant questions - Is it compatible? How to perform the estimation?

So we'd have to rely on oprrest anyway, when actually performing theestimation of a clause with "compatible" operator. And we'd have to keepin sync two places (catalog and checks in file), and we'd have to updatethe catalog after improving the implementation (adding support foranother operator).



kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics / patch v7

Reply via email to