On Tue, Dec 5, 2017 at 10:50 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Jeff Janes <jeff.ja...@gmail.com> writes: > > On Dec 3, 2017 15:31, "Tom Lane" <t...@sss.pgh.pa.us> wrote: > >> Jeff Janes <jeff.ja...@gmail.com> writes: > >>> But I do see that ties within the logical order of the column values > are > >>> broken to agree with the physical order. That is wrong, right? Is > there > >>> any argument that this is desirable? > > >> Uh ... what do you propose doing instead? We'd have to do something > with > >> ties, and it's not so obvious this way is wrong. > > > Let them be tied. If there are 10 distinct values, number the values 0 > to > > 9, and all rows of a given distinct values get the same number for the > > logical order axis. > > Calling the correlation 0.8 when it is really 0.0 seems obviously wrong > to > > me. Although if we switched btree to store duplicate values with tid as > a > > tie breaker, then maybe it wouldn't be as obviously wrong. > > I thought some more about this. What we really want the correlation stat > to do is help us estimate how randomly an index-ordered scan will access > the heap. The correlation is used in another place, estimating how much of the table we will visit in the first place. If the correlation is very high, then scanning 10% of the index leaf pages means we will visit 10% of the table. If the correlation is low, then we use Mackert and Lohman, and (in the case of visiting 10% of the index) predict we will visit most of the table. Assuming effective_cache_size is high, we will visit most of the table just once, but still in a random order, because subsequent visits for the same query will be found in the cache. Rather than visiting the various pages repeatedly and not finding them in cache each time. In addition to estimating how much of the table we visit, we also estimate how "sequential like" those visits are. Which is the use that you describe. Ideally for that use case, we would know for each distinct value, how correlated the tids are with the leaf page ordering. If the index is freshly built, that is very high. We visit 1/10 of the index, which causes us to visit 100% of the table but in perfect order, plucking 1/10 of the tuples from each table page. But visiting 100% of the table in physical order in order to pluck out 10% of the tuples from each page is quite different than visiting 10% of the table pages in physical order to pluck out 100% of the tuples from those pages and 0% from the pages not visited. ... BTW, I disagree that "correlation = zero" is the right answer for this > particular example. If the btree is freshly built, then an index-order > scan would visit all the heap pages in sequence to fetch "f" rows, and > then visit them all in sequence again to fetch "t" rows, which is a whole > lot better than the purely random access that zero correlation implies. > So I think 0.8 or so is actually a perfectly reasonable answer when the > index is fresh. The trouble is just that it'd behoove us to derate that > answer somewhat for the probability that the index isn't fresh. > But, for the case of "how much of the table do we visit at all", correlation = zero is the right answer, even if it isn't the right answer for "how sequentially do we visit whatever we visit" > My first thought for a concrete fix was to use the mean position of > a group of duplicates for the purposes of the correlation calculation, > but on reflection that's clearly wrong. We do have an idea, from the > data we have, whether the duplicates are close together in the heap > or spread all over. Using only mean position would fail to distinguish > those cases, but really we'd better penalize the spread-all-over case. > I'm not sure how to do that. > Departing from correlations, we could also try to estimate "How many different table pages does each index leaf page reference". This could capture functional dependencies which are strong, but not in the form of linear correlations. (The current extended statistics only captures dependencies between user columns, not between one user column and one system column such as table slot) For whatever its worth, here is my "let ties be ties" patch. It breaks two regression tests due to plan changes, and both are cases where maybe the plan ought to change for the very reason being discussed. If I just put random gibberish into the correlation field, more regression tests fail, so I think my implementation is not too far broken. The accumulations into corr_ysum and corr_y2sum could trivially be pushed down into the "if", and corr_xysum could as well with a little algebra. But that seems like premature optimization for a proof-of-concept patch. Cheers, Jeff

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index f952b3c..3369784 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -2309,6 +2309,8 @@ compute_scalar_stats(VacAttrStatsP stats, bool is_varwidth = (!stats->attrtype->typbyval && stats->attrtype->typlen < 0); double corr_xysum; + double corr_ysum; + double corr_y2sum; SortSupportData ssup; ScalarItem *values; int values_cnt = 0; @@ -2429,6 +2431,8 @@ compute_scalar_stats(VacAttrStatsP stats, * be ordered by tupno). */ corr_xysum = 0; + corr_ysum = 0; + corr_y2sum = 0; ndistinct = 0; nmultiple = 0; dups_cnt = 0; @@ -2436,7 +2440,9 @@ compute_scalar_stats(VacAttrStatsP stats, { int tupno = values[i].tupno; - corr_xysum += ((double) i) * ((double) tupno); + corr_xysum += ((double) ndistinct) * ((double) tupno); + corr_ysum += ((double) ndistinct); + corr_y2sum += ((double) ndistinct) * ((double) ndistinct); dups_cnt++; if (tupnoLink[tupno] == tupno) { @@ -2777,11 +2783,11 @@ compute_scalar_stats(VacAttrStatsP stats, MemoryContextSwitchTo(old_context); /*---------- - * Since we know the x and y value sets are both + * Since we know the x value sets are * 0, 1, ..., values_cnt-1 - * we have sum(x) = sum(y) = + * we have sum(x) = * (values_cnt-1)*values_cnt / 2 - * and sum(x^2) = sum(y^2) = + * and sum(x^2) = * (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6. *---------- */ @@ -2791,8 +2797,8 @@ compute_scalar_stats(VacAttrStatsP stats, ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0; /* And the correlation coefficient reduces to */ - corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) / - (values_cnt * corr_x2sum - corr_xsum * corr_xsum); + corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_ysum) / + sqrt((values_cnt * corr_x2sum - corr_xsum) * (values_cnt * corr_y2sum - corr_ysum)); stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION; stats->staop[slot_idx] = mystats->ltopr;