Re: [HACKERS] multivariate statistics v14

Tatsuo Ishii Tue, 22 Mar 2016 05:47:45 -0700

> On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:
>>>> Hum. So without 0006 or beyond, there's not much benefit for the
>>>> PostgreSQL users, and you are not too confident about 0006 or
>>>> beyond. Then I would think it is a little bit hard to justify in
>>>> putting 000[2-5] into 9.6. I really like this feature and would
>>>> like to see in PostgreSQL someday, but I'm not sure if we should
>>>> put the patches (0002-0005) into PostgreSQL now. Please let me
>>>> know if there's some reaons we should put the patches into
>>>> PostgreSQL now.
>>>
>>> I don't think so. While being able to combine multiple statistics
>>> is certainly useful, I'm convinced that the initial patched add
>>> enough
>>
>> Can you please elaborate a little bit more how combining multiple
>> statistics is useful?
> 
> Sure.
> 
> The goal of multivariate statistics is to approximate a probability
> distribution on a group of columns. The larger the number of columns,
> the less accurate the statistics will be (with respect to individual
> columns), assuming fixed size of the sample in ANALYZE, and fixed
> statistics size.
> 
> For example, if you add a column to multivariate histogram, you'll do
> some "bucket splits" by this dimension, thus reducing the accuracy for
> the other columns. You may of course allow larger statistics
> (e.g. histograms with more buckets), but that also requires larger
> samples, and so on.
> 
> Now, let's  assume you have a query like this:
> 
>     WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)
> 
> and that "a" and "b" are correlated, and "c" and "d" are correlated,
> but that otherwise the columns are independent. It'd be a bit silly to
> require building statistics on (a,b,c,d), when two statistics on each
> of the column pairs would be cheaper and also more accurate.
> 
> That's of course a trivial case - independent groups of correlated
> columns. But I'd say this is actually a pretty common case, and I do
> believe there's not much controversy that we should support it.
> 
> Another reason to allow multiple statistics is that columns in one
> group may be a good fit for MCV list (which works well for discrete
> values), while the other group may be a good candidate for histogram
> (which works well for continuous values). This can't be solved by
> first building a MCV and then a histogram on the group.
> 
> The question of course is what to do if the groups are not
> independent. The patch does that by assuming the statistics overlap,
> and uses conditions on the columns included in both statistics to
> combine them using conditional probabilities. I do believe this works
> quite well, but this is perhaps the part that needs further
> discussion. There are other ways to combine the statistics, but I do
> expect them to be considerably more expensive.
> 
> Is this a sufficient explanation?
> 
> Of course, there's a fair amount of additional complexity that I have
> not mentioned here (e.g. selecting the right combination of stats).


Sorry, maybe I did not explain clearyly. My question is, if put
patches only 0002 to 0005 into 9.6, does it still give any visible
benefit to users?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics v14

Reply via email to