On 06/28/2018 06:47 PM, Teodor Sigaev wrote: > Hi! > > ... > > - cost for i-th columns: > Ci * log2(NGi) => Ci * log2(N/G(i-1)) > So, here I believe, that i-th comparison function will be called only > inside > group which number is defined by (i-1) leading columns. Note, following > discussion [1] I add multiplier 1.5 here to be closer to worst case when > groups are significantly differ. Right now there is no way to > determine how > differ they could be. Note, this formula describes cost of first > column too > except difference with multiplier.
One more thought about estimating the worst case - I wonder if simply multiplying the per-tuple cost by 1.5 is the right approach. It does not seem particularly principled, and it's trivial simple to construct counter-examples defeating it (imagine columns with 99% of the rows having the same value, and then many values in exactly one row - that will defeat the ndistinct-based group size estimates). The reason why constructing such counter-examples is so simple is that this relies entirely on the ndistinct stats, ignoring the other stats we already have. I think this might leverage the per-column MCV lists (and eventually multi-column MCV lists, if that ever gets committed). We're estimating the number of tuples in group (after fixing values in the preceding columns), because that's what determines the number of comparisons we need to make at a given stage. The patch does this by estimating the average group size, by estimating ndistinct of the preceding columns G(i-1), and computing ntuples/G(i-1). But consider that we also have MCV lists for each column - if there is a very common value, it should be there. So let's say M(i) is a frequency of the most common value in i-th column, determined either from the MCV list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i) for the fist i columns, then the largest possible group of values when comparing values in i-th column is N * m(i-1) This seems like a better way to estimate the worst case. I don't know if this should be used instead of NG(i), or if those two estimates should be combined in some way. I think estimating the largest group we need to sort should be helpful for the incremental sort patch, so I'm adding Alexander to this thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services