Thank you for finishing this. At Mon, 20 Mar 2017 16:02:20 -0300, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote in <20170320190220.ixlaueanxegqd5gr@alvherre.pgsql> > Here is a closer to final version of the multivariate statistics series, > last posted at > https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql
I'm sorry but this seems conflicting the current master(17fa3e8) and a3eac988c267, which is the base of v28. > If you've always wanted to review multivariate stats, but never found a > good reason to, now is a terrific time to do so! (In other words: I > plan to get this pushed in the not too distant future.) Great! But sorry for having contributed not so much. > This is a new thread to present a version of the n-distinct patch that > IMO is close enough to commit. There are some work items still. > There's some discussion on the topic of cross-column statistics: > https://wiki.postgresql.org/wiki/Cross_Columns_Stats > > This problem is important enough that Kyotaro Horiguchi submitted > another patch that does the same thing: > https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp > This patch aims to provide the same functionality, keeping the design > general enough that other kinds of statistics can be added later (such > as functional dependencies, histograms and MCVs, all of which have been > previously submitted as patches by Tomas). I may be stupid but I don't get the picture here, specifically about the relation to Tomas's patch. Does this work as infrastructure for Tomas's mv patch? Or in some other relationsip? > To recap, what this patch provides is a new command of the form > CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table > > Note that we put the table name in a separate FROM clause instead of > together with the column name, so that this is more readily extensible > to things that are not just columns, for example expressions that might > involve more than one table (per review from Dean Rasheed). Currently, > only one table is supported. > > In this patch, the "opts" can only be "ndistinct", which creates a > pg_statistic_ext row with the number of distinct groups found in all > possible combination across that set of columns. This can be used when > a GROUP BY or a DISTINCT clause need to estimate the number of distinct > groups in an aggregation. > > > > Some things left to change: > > * Currently, we use the ndistinct value only if the grouping uses > exactly the set of columns covered by a statistics. For example, if we > have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to > the old method, which may result in worse results than if we used the > number we know about (a,b,c) then applied a fixup to consider the > distinctness of (d). Do you planning to realize correcting esitimation of joins perplexed by strong correlations? > * Also, estimate_num_groups() looks a bit patchy. With slightly more > invasive changes we can make it look more natural. > > * I'm not terribly happy with the header organization. I think > VacAttrStats should be in its own (new) src/include/statistics/analyze.h > for example (which cleans up a bunch of existing stuff a bit), and the > new files could do with some slight makeover. > > * The current code uses AttrNumber * and int2vector, in places where it > would be more convenient to use Bitmapsets. > > * We currently try to keep a stats object even if a column in it is > dropped -- for example, if we have stats on (a,b,c) and drop (b), then > we still have stats on (a,c). While this is nice, it creates a bunch of > weird corner cases, so I'm going to rip that out and just drop the > statistics instead. If the user wants stats on (a,c) to remain, they > can create it after (or before) dropping the column. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers