On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <p...@2ndquadrant.com> wrote: > On 10/08/16 13:33, Tomas Vondra wrote: >> >> On 08/10/2016 06:41 AM, Michael Paquier wrote: >>> >>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra >>>> >>>> 2) combining multiple statistics >>>> >>>> >>>> I think the ability to combine multivariate statistics (covering >>>> different >>>> subsets of conditions) is important and useful, but I'm starting to >>>> think >>>> that the current implementation may not be the correct one (which is >>>> why I >>>> haven't written the SGML docs about this part of the patch series yet). >>>> >>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're >>>> estimating query: >>>> >>>> SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 >>>> >>>> but that we only have two statistics (a,b) and (b,c). The current >>>> patch does >>>> about this: >>>> >>>> P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) >>>> >>>> i.e. it estimates the first two conditions using (a,b), and then >>>> estimates >>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very >>>> efficient, >>>> but it only works as long as the query contains conditions >>>> "connecting" the >>>> two statistics. So if we remove the "b=2" condition from the query, this >>>> stops working. >>> >>> >>> This is trying to make the algorithm smarter than the user, which is >>> something I'd think we could live without. In this case statistics on >>> (a,c) or (a,b,c) are missing. And what if the user does not want to >>> make use of stats for (a,c) because he only defined (a,b) and (b,c)? >>> >> >> I don't think so. Obviously, if you have statistics covering all the >> conditions - great, we can't really do better than that. >> >> But there's a crucial relation between the number of dimensions of the >> statistics and accuracy of the statistics. Let's say you have statistics >> on 8 columns, and you split each dimension twice to build a histogram - >> that's 256 buckets right there, and we only get ~50% selectivity in each >> dimension (the actual histogram building algorithm is more complex, but >> you get the idea). > > I think it makes sense to pursue this, but I also think we can easily live > with not having it in the first version that gets committed and doing it as > follow-up patch.
This patch is large and complicated enough. As this is not a mandatory piece to get a basic support, I'd suggest just to drop that for later. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers