Re: [HACKERS] multivariate statistics / patch v7

Tomas Vondra Mon, 27 Jul 2015 14:55:20 -0700

Hello Horiguchi-san,

On 07/27/2015 09:04 AM, Kyotaro HORIGUCHI wrote:

Hello,


At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote 
in <55b3fb0b.7000...@2ndquadrant.com>

Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:

Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.


Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.


Ah. My modification to rebase to the master for the time should
be culprit. Sorry for the dirty patch.

# I would recreate the patch if you complained before struggling
# with the thing..

The core of the modification is on closesel.c. I attached the
patched closesel.c.


I don't see any attachment. Perhaps you forgot to actually attach it?


My concern about the code at the time was following,

- You embedded the logic of multivariate estimation into
   clauselist_selectivity. I think estimate using multivariate
   statistics is quite different from the ordinary estimate based
   on single column stats then they are logically separatable and
   we should do so.

I don't see them as very different, actually quite the opposite. The twokinds of statistics are complementary and should naturally coexist.Perhaps the current code is not perfect and a refactoring would make thecode more readable, but I don't think it's primary aim should be toseparate regular and multivariate stats.


- You are taking top-down approach and it runs tree-walking to
   check appliability of mv-stats for every stepping down in
   clause tree. If the subtree found to be mv-applicable, split it
   to two parts - mv-compatible and non-compatible. These steps
   requires expression tree walking, which looks using too-much
   CPU.

I'm taking top-down approach because that's what the regular stats do,and also because that's what allows implementing the features that Ithink are interesting - ability to combine multiple stats in anefficient way, pass conditions and such. I think those two features arevery useful and allow very interesting things.

The bottom-up would work too, probably - I mean, we could start fromleaves of the expression tree, and build the largest "subtree"compatible with multivariate stats and then try to estimate it. I don'tsee how we could pass conditions though, which works naturally in thetop-down approach.

Or maybe a combination of both - identify the "compatible" subtreesfirst, then perform the top-down phase.

- You look to be considering the cases when users create many
   multivariate statistics on attribute sets having
   duplications. But it looks too-much for me. MV-stats are more
   resource-eating so we can assume the minimum usage of that.

Not really. I don't expect huge numbers of multivariate stats to bebuilt on the tables.

But I think restricting the users to use a single multivariatestatistics per table would be a significant limitation. And once youallow using multiple multivariate statistics for the set of clauses,supporting over-lapping stats is not that difficult.

What it however makes possible is combining multiple "small" stats intoa larger one in a very efficient way - it assumes the overlap issufficient, of course. But if that's true you may build multiple small(and very accurate) stats instead of one huge (or very inaccurate)statistics.

This also makes it possible to handle complex combinations of clausesthat are compatible and incompatible with multivariate statistics, bypassing the conditions.


My suggestion in the patch is a bottom-up approach to find
mv-applicable portion(s) in the expression tree, which is the
basic way of planner overall. The approach requires no repetitive
run of tree walker, that is, pull_varnos. It could fail to find
the 'optimal' solution for complex situations but needs far less
calculation for almost the same return (I think..).

Even though it doesn't consider the functional dependency, the
reduce of the code shows the efficiency. It does not nothing
tricky.

OK

On a conceptual level, I think the idea to split the estimation into
two phases - enrich the expression tree with nodes with details about
stats etc, and then actually do the estimation in the second phase
might be interesting. Not because it's somehow clearer, but because it
gives us a chance to see the expression tree as a whole, with details
about all the stats (with the current code we process/estimate the
tree incrementally). But I don't really know how useful that would be.


It is difficult to say which approach is better sinch it is
affected by what we think important than other things. However I
concern about that your code substantially reconstructs the
expression (clause) tree midst of processing it. I believe it
should be a separate phase for simplicity. Of course additional
required resource is also should be considered but it is rather
reduced for this case.

What do you mean by "reconstruct the expression tree"? It's true I'mwalking the expression tree top-down, but how is that reconstructing?

I don't think the proposed change makes the process somehow clearer. I
know it's a PoC at this point, so I don't expect it to be perfect, but
for me the original code is certainly clearer. Of course, I'm biased
as I wrote the current code, and I (naturally) shaped it to match my
ideas during the development process, and I'm much more familiar with
it.


Mmm. we need someone else's opition:) What I think on this point
is described just above... OK, I try to describe this in other
words.

I find your comments very valuable. I may not agree with some of them,but I certainly appreciate your point of view. So thank you very muchfor the time you spent reviewing this patch so far!

The embedded approach simply increases the state and code path by,
roughly, multiplication basis. The separate approcach adds them in
addition basis. I thinks this is the most siginificant point of why I
feel it 'clear'.

Of course, the acceptable complexity differs according to the
fundamental complexity, performance, required memory or someting
others but I feel it is too-much complexity for the objective.


Yes, I think we might have slightly different objectives in mind.

Regarding the complexity - I am not too worried about spending more CPUcycles on this, as long as it does not impact the case where people haveno multivariate statistics at all. That's because I expect people to usethis for large DSS/DWH data sets with lots of dependencies in the (oftendenormalized) tables and complex conditions - in those cases theplanning difference is negligible, especially if the improved estimatesmake the query run in seconds instead of hours.

This is why I was so careful to entirely skip the expensive processingwhen where were no multivariate stats, and why I don't like the factthat your approach makes this skip more difficult (or maybe impossible,I'm not sure).

It's also true that most OLTP queries (especially the short ones, thusmost impacted by the increase of planning time) use rather short/simpleclause lists, so even the top-down approach should be very cheap.

Omitting the support for functional dependencies is a bit unfortunate,
I think. Is that merely to make the PoC simpler, or is there something
that makes it impossible to support that kind of stats?


I don't think so. I ommited it simply because it would more time
to implement.


OK, thanks for confirming this.

Another thing that I noticed is that you completely removed the code
that combined multiple stats (and selected the best combination of
stats). In other words, you've reverted to the intermediate single
statistics approach, including removing the improved handling of OR
clauses and conditions.


Yeah, good catch :p I noticed that just after submitting the
patch that I retaion only one statistics at the second level from
the bottom but it is easily fixed by changing pruning timing. The
struct can hold multiple statistics anyway.


Great!


And I don't omit OR case. It is handled along with the AND
case. (in wrong way?)

Oh, I see. I got a bit confused because you've removed the optimizationstep (and conditions), and that needs to be handled a bit differentlyfor the OR clauses.

  It's a bit difficult to judge the proposed
approach not knowing how well it supports those (quite crucial)
features. What if it can't support some them., or what if it makes the
code much more complicated (thus defeating the goal of making it more
clear)?


OR is supported, Fdep is maybe supportable, but all of them
occurs within the function with the entangled name
(transform..something). But I should put more consider on your
latest code before that.


Good. Likewise, I'd like to see more of your approach ;-)

I share your concern about the performance impact - one thing is that
this new code might be slower than the original one, but a more
serious issue IMHO is that the performance impact will happen even for
relations with no multivariate stats at all. The original patch was
very careful about getting ~0% overhead in such cases,


I don't think so. find_stats runs pull_varnos and
transformRestric.. also uses pull_varnos to bail out at the top
level. They should have almost the same overhead for the case.

Understood. As I explained above, I'm not all that concerned about theperformance impact, as long as we make sure it only applies to peopleusing the multivariate stats.

I also think a combined approach - first a bottom-up step (identifyingthe largest compatible subtrees & caching the varnos), then a top-downstep (doing the same optimization as implemented today) might minimizethe performance impact.

and if the new
code does not allow that, I don't see this approach as acceptable. We
must not put additional overhead on people not using multivariate
stats.

But I think it's worth exploring this idea a bit more - can you rebase
it to the current patch version (as on github) and adding the missing
pieces (functional dependencies, multi-statistics estimation and
passing conditions)?


With pleasure. Please wait for a while.


Sure. Take your time.

One more thing - I noticed you extended the pg_operator catalog with a
oprmvstat attribute, used to flag operators that are compatible with
multivariate stats. I'm not happy with the current approach (using
oprrest to do this decision), but I'm not really sure this is a good
solution either. The culprit is that it only answers one of the two
important questions - Is it compatible? How to perform the estimation?


Hostly saying, I also don't like this. But checking oprrest is
unpleasant much the same.

The patch is already quite massive, so let's use the same approach ascurrent stats, and leave this problem for another patch. If we come upwith a great idea, we can work on it, but I see this as a looselyrelated annoyance rather than something this patch aims to address.

So we'd have to rely on oprrest anyway, when actually performing the
estimation of a clause with "compatible" operator. And we'd have to
keep in sync two places (catalog and checks in file), and we'd have to
update the catalog after improving the implementation (adding support
for another operator).


Mmm. It depends on what the deveopers think about the definition
of oprrest. More practically, I'm worried whether it cannot be
other than eqsel for any equality operator. And the same for
comparison operators.

OTOH if you define a new operator with oprrest=F_EQSEL, you'reeffectively saying "It's OK to estimate this using regular eq/lt/gtoperators". If your operator is somehow incompatible with that, youshould not set oprrest=F_EQSEL.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics / patch v7

Reply via email to