On Mon, Jan 25, 2016 at 5:11 PM, Shulgin, Oleksandr < oleksandr.shul...@zalando.de> wrote: > > On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra < tomas.von...@2ndquadrant.com> wrote: >> >> >> Overall, I think this is really about deciding when to cut-off the MCV, so that it does not grow needlessly large - as Robert pointed out, the larger the list, the more expensive the estimation (and thus planning). >> >> So even if we could fit the whole sample into the MCV list (i.e. we believe we've seen all the values and we can fit them into the MCV list), it may not make sense to do so. The ultimate goal is to estimate conditions, and if we can do that reasonably even after cutting of the least frequent values from the MCV list, then why not? >> >> From this point of view, the analysis concentrates deals just with the ANALYZE part and does not discuss the estimation counter-part at all. > > > True, this aspect still needs verification. As stated, my primary motivation was to improve the plan stability for relatively short MCV lists. > > Longer MCV lists might be a different story, but see "Increasing stats target" section of the original mail: increasing the target doesn't give quite the expected results with unpatched code either.
To address this concern I've run my queries again on the same dataset, now focusing on how the number of MCV items changes with the patched code (using the CTEs from my original mail): WITH ... SELECT count(1), min(num_mcv)::real, avg(num_mcv)::real, max(num_mcv)::real, stddev(num_mcv)::real FROM stats2 WHERE num_mcv IS NOT NULL; (ORIGINAL) count | 27452 min | 1 avg | 32.7115 max | 100 stddev | 40.6927 (PATCHED) count | 27527 min | 1 avg | 38.4341 max | 100 stddev | 43.3596 A significant portion of the MCV lists is occupying all 100 slots available with the default statistics target, so it also interesting to look at the stats that habe "underfilled" MCV lists (by changing the condition of the WHERE clause to read "num_mcv < 100"): (<100 ORIGINAL) count | 20980 min | 1 avg | 11.9541 max | 99 stddev | 18.4132 (<100 PATCHED) count | 19329 min | 1 avg | 12.3222 max | 99 stddev | 19.6959 As one can see, with the patched code the average length of MCV lists doesn't change all that dramatically, while at the same time exposing all the improvements described in the original mail. >> After fixing the estimator to consider fraction of NULLs, the estimates look like this: >> >> statistics target | master | patched >> ------------------------------------------ >> 100 | 1302 | 5356 >> 1000 | 6022 | 6791 >> >> So this seems to significantly improve the ndistinct estimate (patch attached). > > > Hm... this looks correct. And compute_distinct_stats() needs the same treatment, obviously. I've incorporated this fix into the v2 of my patch, I think it is related closely enough. Also, added corresponding changes to compute_distinct_stats(), which doesn't produce a histogram. I'm adding this to the next CommitFest. Further reviews are very much appreciated! -- Alex
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 070df29..cbf3538 100644 *** a/src/backend/commands/analyze.c --- b/src/backend/commands/analyze.c *************** *** 2079,2095 **** denom, stadistinct; ! numer = (double) samplerows *(double) d; ! denom = (double) (samplerows - f1) + ! (double) f1 *(double) samplerows / totalrows; stadistinct = numer / denom; /* Clamp to sane range in case of roundoff error */ if (stadistinct < (double) d) stadistinct = (double) d; ! if (stadistinct > totalrows) ! stadistinct = totalrows; stats->stadistinct = floor(stadistinct + 0.5); } --- 2079,2099 ---- denom, stadistinct; ! double samplerows_nonnull = samplerows - null_cnt; ! double totalrows_nonnull ! = totalrows * (1.0 - stats->stanullfrac); ! numer = samplerows_nonnull * (double) d; ! ! denom = (samplerows_nonnull - f1) + ! (double) f1 * samplerows_nonnull / totalrows_nonnull; stadistinct = numer / denom; /* Clamp to sane range in case of roundoff error */ if (stadistinct < (double) d) stadistinct = (double) d; ! if (stadistinct > totalrows_nonnull) ! stadistinct = totalrows_nonnull; stats->stadistinct = floor(stadistinct + 0.5); } *************** *** 2120,2146 **** } else { double ndistinct = stats->stadistinct; - double avgcount, - mincount; if (ndistinct < 0) ! ndistinct = -ndistinct * totalrows; ! /* estimate # of occurrences in sample of a typical value */ ! avgcount = (double) samplerows / ndistinct; ! /* set minimum threshold count to store a value */ ! mincount = avgcount * 1.25; ! if (mincount < 2) ! mincount = 2; if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) { if (track[i].count < mincount) { num_mcv = i; break; } } } --- 2124,2160 ---- } else { + /* + * Starting number of values: samplerows sans nulls and too wide + * ones. + */ + int sample_cnt = nonnull_cnt - toowide_cnt; double ndistinct = stats->stadistinct; if (ndistinct < 0) ! ndistinct = -ndistinct * sample_cnt; ! if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) { + double avgcount, + mincount; + + /* estimate # of occurrences in sample of a typical value */ + avgcount = (double) sample_cnt / ndistinct; + + /* set minimum threshold count to store a value */ + mincount = 1.25 * avgcount; if (track[i].count < mincount) { num_mcv = i; break; } + + /* Narrow our view of samples and distincts left */ + sample_cnt -= track[i].count; + ndistinct--; } } *************** *** 2428,2444 **** denom, stadistinct; ! numer = (double) samplerows *(double) d; ! denom = (double) (samplerows - f1) + ! (double) f1 *(double) samplerows / totalrows; stadistinct = numer / denom; /* Clamp to sane range in case of roundoff error */ if (stadistinct < (double) d) stadistinct = (double) d; ! if (stadistinct > totalrows) ! stadistinct = totalrows; stats->stadistinct = floor(stadistinct + 0.5); } --- 2442,2462 ---- denom, stadistinct; ! double samplerows_nonnull = samplerows - null_cnt; ! double totalrows_nonnull ! = totalrows * (1.0 - stats->stanullfrac); ! ! numer = samplerows_nonnull * (double) d; ! denom = (samplerows_nonnull - f1) + ! (double) f1 * samplerows_nonnull / totalrows_nonnull; stadistinct = numer / denom; /* Clamp to sane range in case of roundoff error */ if (stadistinct < (double) d) stadistinct = (double) d; ! if (stadistinct > totalrows_nonnull) ! stadistinct = totalrows_nonnull; stats->stadistinct = floor(stadistinct + 0.5); } *************** *** 2464,2469 **** --- 2482,2489 ---- * emit duplicate histogram bin boundaries. (We might end up with * duplicate histogram entries anyway, if the distribution is skewed; * but we prefer to treat such values as MCVs if at all possible.) + * We also decrease ndistinct in the process such that going forward + * it refers to the number of distinct values left for the histogram. */ if (track_cnt == ndistinct && toowide_cnt == 0 && stats->stadistinct > 0 && *************** *** 2471,2505 **** { /* Track list includes all values seen, and all will fit */ num_mcv = track_cnt; } else { ! double ndistinct = stats->stadistinct; ! double avgcount, ! mincount, ! maxmincount; - if (ndistinct < 0) - ndistinct = -ndistinct * totalrows; - /* estimate # of occurrences in sample of a typical value */ - avgcount = (double) samplerows / ndistinct; - /* set minimum threshold count to store a value */ - mincount = avgcount * 1.25; - if (mincount < 2) - mincount = 2; - /* don't let threshold exceed 1/K, however */ - maxmincount = (double) samplerows / (double) num_bins; - if (mincount > maxmincount) - mincount = maxmincount; if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) { ! if (track[i].count < mincount) { ! num_mcv = i; ! break; } } } --- 2491,2548 ---- { /* Track list includes all values seen, and all will fit */ num_mcv = track_cnt; + + /* Nothing left for the histogram */ + num_hist = 0; + ndistinct = 0; } else { ! /* ! * Starting number of values left for the histogram: samplerows ! * sans nulls and too wide ones. ! */ ! int sample_cnt = values_cnt; ! ! num_hist = ndistinct; ! if (num_hist > num_bins) ! num_hist = num_bins + 1; if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) { ! if (num_hist >= 2) { ! double avgcount, ! mincount, ! maxmincount; ! ! /* estimate # of occurrences in sample of a typical value */ ! avgcount = (double) sample_cnt / (double) ndistinct; ! ! /* set minimum threshold count to store a value */ ! mincount = 1.25 * avgcount; ! ! /* don't let threshold exceed 1/K, however */ ! maxmincount = (sample_cnt - 1) / (double) (num_hist - 1); ! if (mincount > maxmincount) ! mincount = maxmincount; ! if (track[i].count < mincount) ! { ! num_mcv = i; ! break; ! } } + + /* Narrow our view of samples left for the histogram */ + sample_cnt -= track[i].count; + ndistinct--; + + /* Recalculate histogram size due to lower ndistinct */ + num_hist = ndistinct; + if (num_hist > num_bins) + num_hist = num_bins + 1; } } *************** *** 2542,2550 **** * values not accounted for in the MCV list. (This ensures the * histogram won't collapse to empty or a singleton.) */ - num_hist = ndistinct - num_mcv; - if (num_hist > num_bins) - num_hist = num_bins + 1; if (num_hist >= 2) { MemoryContext old_context; --- 2585,2590 ----
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers