Re: [HACKERS] multivariate statistics (v19)

Petr Jelinek Wed, 10 Aug 2016 04:51:42 -0700

On 10/08/16 13:33, Tomas Vondra wrote:

On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra

2) combining multiple statistics


I think the ability to combine multivariate statistics (covering
different
subsets of conditions) is important and useful, but I'm starting to
think
that the current implementation may not be the correct one (which is
why I
haven't written the SGML docs about this part of the patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

   SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current
patch does
about this:

   P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient,
but it only works as long as the query contains conditions
"connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.


This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?


I don't think so. Obviously, if you have statistics covering all the
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the
statistics and accuracy of the statistics. Let's say you have statistics
on 8 columns, and you split each dimension twice to build a histogram -
that's 256 buckets right there, and we only get ~50% selectivity in each
dimension (the actual histogram building algorithm is more complex, but
you get the idea).

I think it makes sense to pursue this, but I also think we can easilylive with not having it in the first version that gets committed anddoing it as follow-up patch.


--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

Reply via email to