"Shulgin, Oleksandr" <oleksandr.shul...@zalando.de> writes: > On Apr 1, 2016 23:14, "Tom Lane" <t...@sss.pgh.pa.us> wrote: >> Haven't looked at 0002 yet.

## Advertising

> [crosses fingers] hope you'll have a chance to do that before feature > freeze for 9.6 I studied this patch for awhile after rebasing it onto yesterday's commits. I did not like the fact that the compute_scalar_stats logic would allow absolutely anything into the MCV list once num_hist falls below 2. I think it's important that we continue to reject values that are only seen once in the sample, because there's no very good reason to think that they are MCVs and not just infrequent values that by luck appeared in the sample. However, after I rearranged the tests there so that "if (num_hist >= 2)" only controlled whether to apply the 1/K limit, one of the regression tests started to fail: there's a place in rowsecurity.sql that expects that if a column contains nothing but several instances of a single value, that value will be recorded as a lone MCV. Now this isn't a particularly essential thing for that test, but it still seems like a good property for ANALYZE to have. The reason it's failing, of course, is that the test as written cannot possibly accept the last (or only) value. Before I noticed the regression failure, I'd been thinking that maybe it'd be better if the decision rule were not "at least 100+x% of the average frequency of this value and later ones", but "at least 100+x% of the average frequency of values after this one". With that formulation, we're not constrained as to the range of x. Now, if there are *no* values after this one, then this way needs an explicit special case in order not to compute 0/0; but the preceding para shows that we need a special case for the last value anyway. So, attached is a patch rewritten along these lines. I used 50% rather than 25% as the new cutoff percentage --- obviously it should be higher in this formulation than before, but I have no idea if that particular number is good or we should use something else. Also, the rule for the last value is "at least 1% of the non-null samples". That's a pure guess as well. I do not have any good corpuses of data to try this on. Can folks who have been following this thread try it on their data and see how it does? Also please try some other multipliers besides 1.5, so we can get a feeling for where that cutoff should be placed. regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 44a4b3f..a2c606b 100644 *** a/src/backend/commands/analyze.c --- b/src/backend/commands/analyze.c *************** compute_distinct_stats(VacAttrStatsP sta *** 2120,2128 **** * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), * then do so. Otherwise, store only those values that are ! * significantly more common than the (estimated) average. We set the ! * threshold rather arbitrarily at 25% more than average, with at ! * least 2 instances in the sample. */ if (track_cnt < track_max && toowide_cnt == 0 && stats->stadistinct > 0 && --- 2120,2138 ---- * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), * then do so. Otherwise, store only those values that are ! * significantly more common than the ones we omit. We determine that ! * by considering the values in frequency order, and accepting each ! * one if it is at least 50% more common than the average among the ! * values after it. The 50% threshold is somewhat arbitrary. ! * ! * Note that the 50% rule will never accept a value with count 1, ! * since all the values have count at least 1; this is a property we ! * desire, since there's no very good reason to assume that a ! * single-occurrence value is an MCV and not just a random non-MCV. ! * ! * We need a special rule for the very last value. If we get to it, ! * we'll accept it if it's at least 1% of the non-null samples and has ! * count more than 1. */ if (track_cnt < track_max && toowide_cnt == 0 && stats->stadistinct > 0 && *************** compute_distinct_stats(VacAttrStatsP sta *** 2133,2153 **** } else { ! /* d here is the same as d in the Haas-Stokes formula */ int d = nonnull_cnt - summultiple + nmultiple; ! double avgcount, ! mincount; ! /* estimate # occurrences in sample of a typical nonnull value */ ! avgcount = (double) nonnull_cnt / (double) d; ! /* set minimum threshold count to store a value */ ! mincount = avgcount * 1.25; ! if (mincount < 2) ! mincount = 2; if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) { if (track[i].count < mincount) { num_mcv = i; --- 2143,2181 ---- } else { ! /* d here is initially the same as d in the Haas-Stokes formula */ int d = nonnull_cnt - summultiple + nmultiple; ! int remaining_samples = nonnull_cnt; ! /* can't store more MCVs than we tracked ... */ if (num_mcv > track_cnt) num_mcv = track_cnt; + /* locate first value we're not going to store as an MCV */ for (i = 0; i < num_mcv; i++) { + double avgcount, + mincount; + + /* remove current value from remaining_samples and d */ + remaining_samples -= track[i].count; + d--; + if (d > 0) + { + /* get avg # occurrences of distinct values after this */ + avgcount = (double) remaining_samples / (double) d; + /* set minimum count to accept a value (is surely > 1) */ + mincount = avgcount * 1.50; + } + else + { + /* last value, use 1% rule */ + mincount = nonnull_cnt * 0.01; + + /* here, we need a clamp to avoid accepting count 1 */ + if (mincount < 2) + mincount = 2; + } + /* if this value falls below threshold, we're done */ if (track[i].count < mincount) { num_mcv = i; *************** compute_scalar_stats(VacAttrStatsP stats *** 2375,2381 **** /* * Found a new item for the mcv list; find its * position, bubbling down old items if needed. Loop ! * invariant is that j points at an empty/ replaceable * slot. */ int j; --- 2403,2409 ---- /* * Found a new item for the mcv list; find its * position, bubbling down old items if needed. Loop ! * invariant is that j points at an empty/replaceable * slot. */ int j; *************** compute_scalar_stats(VacAttrStatsP stats *** 2475,2488 **** * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), * then do so. Otherwise, store only those values that are ! * significantly more common than the (estimated) average. We set the ! * threshold rather arbitrarily at 25% more than average, with at ! * least 2 instances in the sample. Also, we won't suppress values ! * that have a frequency of at least 1/K where K is the intended ! * number of histogram bins; such values might otherwise cause us to ! * emit duplicate histogram bin boundaries. (We might end up with ! * duplicate histogram entries anyway, if the distribution is skewed; ! * but we prefer to treat such values as MCVs if at all possible.) */ if (track_cnt == ndistinct && toowide_cnt == 0 && stats->stadistinct > 0 && --- 2503,2532 ---- * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), * then do so. Otherwise, store only those values that are ! * significantly more common than the ones we omit. We determine that ! * by considering the values in frequency order, and accepting each ! * one if it is at least 50% more common than the average among the ! * values after it. The 50% threshold is somewhat arbitrary. ! * ! * We need a special rule for the very last value. If we get to it, ! * we'll accept it if it's at least 1% of the non-null samples and has ! * count more than 1. ! * ! * Also, we will treat values as MCVs if they have a frequency of at ! * least 1/K where K is the intended number of histogram entries. ! * Relegating such values to the histogram might cause us to emit ! * duplicate histogram entries. (We might get duplicate histogram ! * entries anyway, if the distribution is skewed; but we prefer to ! * treat such values as MCVs if at all possible.) For this purpose, ! * measure the frequency with respect to the population represented by ! * the histogram; so in the loop below, cur_remaining_samples/num_hist ! * is the correct calculation. ! * ! * In any case, unless we believe we have a complete MCV list, we will ! * not accept an MCV value with count 1, since there's no very good ! * reason to assume that a single-occurrence value is an MCV and not ! * just a random non-MCV. This is automatic with the 50% rule but ! * needs enforcement with the other ones. */ if (track_cnt == ndistinct && toowide_cnt == 0 && stats->stadistinct > 0 && *************** compute_scalar_stats(VacAttrStatsP stats *** 2490,2523 **** { /* Track list includes all values seen, and all will fit */ num_mcv = track_cnt; } else { ! /* d here is the same as d in the Haas-Stokes formula */ int d = ndistinct + toowide_cnt; ! double avgcount, ! mincount, ! maxmincount; ! /* estimate # occurrences in sample of a typical nonnull value */ ! avgcount = (double) values_cnt / (double) d; ! /* set minimum threshold count to store a value */ ! mincount = avgcount * 1.25; ! if (mincount < 2) ! mincount = 2; ! /* don't let threshold exceed 1/K, however */ ! maxmincount = (double) values_cnt / (double) num_bins; ! if (mincount > maxmincount) ! mincount = maxmincount; if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) { if (track[i].count < mincount) { num_mcv = i; break; } } } --- 2534,2605 ---- { /* Track list includes all values seen, and all will fit */ num_mcv = track_cnt; + /* Nothing left for the histogram */ + num_hist = 0; } else { ! /* d here is initially the same as d in the Haas-Stokes formula */ int d = ndistinct + toowide_cnt; ! int remaining_samples = values_cnt; ! /* ! * num_hist is the planned histogram size; it's limited by the ! * number of distinct sample vals not absorbed into the MCV list. ! * Start with assumption that nothing is absorbed into MCV list. ! */ ! num_hist = Min(ndistinct, num_bins + 1); ! ! /* can't store more MCVs than we tracked ... */ if (num_mcv > track_cnt) num_mcv = track_cnt; + /* locate first value we're not going to store as an MCV */ for (i = 0; i < num_mcv; i++) { + int cur_remaining_samples = remaining_samples; + double avgcount, + mincount; + + /* remove current value from remaining_samples and d */ + remaining_samples -= track[i].count; + d--; + if (d > 0) + { + /* get avg # occurrences of distinct values after this */ + avgcount = (double) remaining_samples / (double) d; + /* set minimum count to accept a value (is surely > 1) */ + mincount = avgcount * 1.50; + } + else + { + /* last value, use 1% rule */ + mincount = values_cnt * 0.01; + + /* here, we need a clamp to avoid accepting count 1 */ + if (mincount < 2) + mincount = 2; + } + + /* don't let threshold exceed 1/K, however */ + if (num_hist >= 2) /* else we won't make a histogram */ + { + double hfreq; + + hfreq = (double) cur_remaining_samples / (double) num_hist; + if (mincount > hfreq) + mincount = hfreq; + /* hfreq could be 1, so clamp to avoid accepting count 1 */ + if (mincount < 2) + mincount = 2; + } + /* if this value falls below threshold, we're done */ if (track[i].count < mincount) { num_mcv = i; break; } + /* update planned histogram size, removing this value */ + num_hist = Min(ndistinct - (i + 1), num_bins + 1); } } *************** compute_scalar_stats(VacAttrStatsP stats *** 2560,2568 **** * values not accounted for in the MCV list. (This ensures the * histogram won't collapse to empty or a singleton.) */ - num_hist = ndistinct - num_mcv; - if (num_hist > num_bins) - num_hist = num_bins + 1; if (num_hist >= 2) { MemoryContext old_context; --- 2642,2647 ----

-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers