Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.

Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.

FWIW I've rebased my patch to the current master, it's available on github as usual:

    https://github.com/tvondra/postgres/commits/mvstats

The code to find mv-applicable clause is moved out of the main
flow of clauselist_selectivity. As I said in the previous mail,
the new function transformRestrictInfoForEstimate (too bad name
but just for PoC:) scans clauselist and generates
RestrictStatsData struct which drives mv-aware selectivity
calculation. This struct isolates MV and non-MV estimation.

The struct RestrictStatData mainly consists of the following
three parts,

   - clause to be estimated by current logic (MV is not applicable)
   - clause to be estimated by MV-staistics.
   - list of child RestrictStatDatas, which are to be run
     recursively.

mvclause_selectivty() is the topmost function where mv stats
works. This structure effectively prevents main estimation flow
from being broken by modifying mvstats part. Although I haven't
measured but I'm positive the code is far reduced from yours.

I attached two patches to this message. The first one is to
rebase v7 patch to current(maybe) master and the second applies
the refactoring.

I'm a little anxious about performance but I think this makes the
process to apply mv-stats far clearer. Regtests for mvstats
succeeded asis except for fdep, which is not implememted in this
patch.

What do you think about this?

I'm not sure, at this point. I'm having a hard time understanding how exactly the code works - there are pretty much no comments explaining the implementation, so it takes time to understand the code. This is especially true about transformRestrictInfoForEstimate which is also quite long. I understand it's a PoC, but comments would really help.

On a conceptual level, I think the idea to split the estimation into two phases - enrich the expression tree with nodes with details about stats etc, and then actually do the estimation in the second phase might be interesting. Not because it's somehow clearer, but because it gives us a chance to see the expression tree as a whole, with details about all the stats (with the current code we process/estimate the tree incrementally). But I don't really know how useful that would be.

I don't think the proposed change makes the process somehow clearer. I know it's a PoC at this point, so I don't expect it to be perfect, but for me the original code is certainly clearer. Of course, I'm biased as I wrote the current code, and I (naturally) shaped it to match my ideas during the development process, and I'm much more familiar with it.

Omitting the support for functional dependencies is a bit unfortunate, I think. Is that merely to make the PoC simpler, or is there something that makes it impossible to support that kind of stats?

Another thing that I noticed is that you completely removed the code that combined multiple stats (and selected the best combination of stats). In other words, you've reverted to the intermediate single statistics approach, including removing the improved handling of OR clauses and conditions. It's a bit difficult to judge the proposed approach not knowing how well it supports those (quite crucial) features. What if it can't support some them., or what if it makes the code much more complicated (thus defeating the goal of making it more clear)?

I share your concern about the performance impact - one thing is that this new code might be slower than the original one, but a more serious issue IMHO is that the performance impact will happen even for relations with no multivariate stats at all. The original patch was very careful about getting ~0% overhead in such cases, and if the new code does not allow that, I don't see this approach as acceptable. We must not put additional overhead on people not using multivariate stats.

But I think it's worth exploring this idea a bit more - can you rebase it to the current patch version (as on github) and adding the missing pieces (functional dependencies, multi-statistics estimation and passing conditions)?

One more thing - I noticed you extended the pg_operator catalog with a oprmvstat attribute, used to flag operators that are compatible with multivariate stats. I'm not happy with the current approach (using oprrest to do this decision), but I'm not really sure this is a good solution either. The culprit is that it only answers one of the two important questions - Is it compatible? How to perform the estimation?

So we'd have to rely on oprrest anyway, when actually performing the estimation of a clause with "compatible" operator. And we'd have to keep in sync two places (catalog and checks in file), and we'd have to update the catalog after improving the implementation (adding support for another operator).


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to