On Mon, Jan 25, 2016 at 5:11 PM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:
>
> On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <
tomas.von...@2ndquadrant.com> wrote:
>>
>>
>> Overall, I think this is really about deciding when to cut-off the MCV,
so that it does not grow needlessly large - as Robert pointed out, the
larger the list, the more expensive the estimation (and thus planning).
>>
>> So even if we could fit the whole sample into the MCV list (i.e. we
believe we've seen all the values and we can fit them into the MCV list),
it may not make sense to do so. The ultimate goal is to estimate
conditions, and if we can do that reasonably even after cutting of the
least frequent values from the MCV list, then why not?
>>
>> From this point of view, the analysis concentrates deals just with the
ANALYZE part and does not discuss the estimation counter-part at all.
>
>
> True, this aspect still needs verification.  As stated, my primary
motivation was to improve the plan stability for relatively short MCV lists.
>
> Longer MCV lists might be a different story, but see "Increasing stats
target" section of the original mail: increasing the target doesn't give
quite the expected results with unpatched code either.

To address this concern I've run my queries again on the same dataset, now
focusing on how the number of MCV items changes with the patched code
(using the CTEs from my original mail):

WITH ...

SELECT count(1),
       min(num_mcv)::real,
       avg(num_mcv)::real,
       max(num_mcv)::real,
       stddev(num_mcv)::real

  FROM stats2

 WHERE num_mcv IS NOT NULL;

(ORIGINAL)
count  | 27452
min    | 1
avg    | 32.7115
max    | 100
stddev | 40.6927

(PATCHED)
count  | 27527
min    | 1
avg    | 38.4341
max    | 100
stddev | 43.3596

A significant portion of the MCV lists is occupying all 100 slots available
with the default statistics target, so it also interesting to look at the
stats that habe "underfilled" MCV lists (by changing the condition of the
WHERE clause to read "num_mcv < 100"):

(<100 ORIGINAL)
count  | 20980
min    | 1
avg    | 11.9541
max    | 99
stddev | 18.4132

(<100 PATCHED)
count  | 19329
min    | 1
avg    | 12.3222
max    | 99
stddev | 19.6959

As one can see, with the patched code the average length of MCV lists
doesn't change all that dramatically, while at the same time exposing all
the improvements described in the original mail.

>> After fixing the estimator to consider fraction of NULLs, the estimates
look like this:
>>
>>     statistics target |   master  |  patched
>>    ------------------------------------------
>>                   100 |     1302  |     5356
>>                  1000 |     6022  |     6791
>>
>> So this seems to significantly improve the ndistinct estimate (patch
attached).
>
>
> Hm... this looks correct.  And compute_distinct_stats() needs the same
treatment, obviously.

I've incorporated this fix into the v2 of my patch, I think it is related
closely enough.  Also, added corresponding changes to
compute_distinct_stats(), which doesn't produce a histogram.

I'm adding this to the next CommitFest.  Further reviews are very much
appreciated!

--
Alex
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..cbf3538 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
***************
*** 2079,2095 ****
  						denom,
  						stadistinct;
  
! 			numer = (double) samplerows *(double) d;
  
! 			denom = (double) (samplerows - f1) +
! 				(double) f1 *(double) samplerows / totalrows;
  
  			stadistinct = numer / denom;
  			/* Clamp to sane range in case of roundoff error */
  			if (stadistinct < (double) d)
  				stadistinct = (double) d;
! 			if (stadistinct > totalrows)
! 				stadistinct = totalrows;
  			stats->stadistinct = floor(stadistinct + 0.5);
  		}
  
--- 2079,2099 ----
  						denom,
  						stadistinct;
  
! 			double		samplerows_nonnull = samplerows - null_cnt;
! 			double		totalrows_nonnull
! 							= totalrows * (1.0 - stats->stanullfrac);
  
! 			numer = samplerows_nonnull * (double) d;
! 
! 			denom = (samplerows_nonnull - f1) +
! 				(double) f1 * samplerows_nonnull / totalrows_nonnull;
  
  			stadistinct = numer / denom;
  			/* Clamp to sane range in case of roundoff error */
  			if (stadistinct < (double) d)
  				stadistinct = (double) d;
! 			if (stadistinct > totalrows_nonnull)
! 				stadistinct = totalrows_nonnull;
  			stats->stadistinct = floor(stadistinct + 0.5);
  		}
  
***************
*** 2120,2146 ****
  		}
  		else
  		{
  			double		ndistinct = stats->stadistinct;
- 			double		avgcount,
- 						mincount;
  
  			if (ndistinct < 0)
! 				ndistinct = -ndistinct * totalrows;
! 			/* estimate # of occurrences in sample of a typical value */
! 			avgcount = (double) samplerows / ndistinct;
! 			/* set minimum threshold count to store a value */
! 			mincount = avgcount * 1.25;
! 			if (mincount < 2)
! 				mincount = 2;
  			if (num_mcv > track_cnt)
  				num_mcv = track_cnt;
  			for (i = 0; i < num_mcv; i++)
  			{
  				if (track[i].count < mincount)
  				{
  					num_mcv = i;
  					break;
  				}
  			}
  		}
  
--- 2124,2160 ----
  		}
  		else
  		{
+ 			/*
+ 			 * Starting number of values: samplerows sans nulls and too wide
+ 			 * ones.
+ 			 */
+ 			int			sample_cnt = nonnull_cnt - toowide_cnt;
  			double		ndistinct = stats->stadistinct;
  
  			if (ndistinct < 0)
! 				ndistinct = -ndistinct * sample_cnt;
! 
  			if (num_mcv > track_cnt)
  				num_mcv = track_cnt;
  			for (i = 0; i < num_mcv; i++)
  			{
+ 				double		avgcount,
+ 							mincount;
+ 
+ 				/* estimate # of occurrences in sample of a typical value */
+ 				avgcount = (double) sample_cnt / ndistinct;
+ 
+ 				/* set minimum threshold count to store a value */
+ 				mincount = 1.25 * avgcount;
  				if (track[i].count < mincount)
  				{
  					num_mcv = i;
  					break;
  				}
+ 
+ 				/* Narrow our view of samples and distincts left */
+ 				sample_cnt -= track[i].count;
+ 				ndistinct--;
  			}
  		}
  
***************
*** 2428,2444 ****
  						denom,
  						stadistinct;
  
! 			numer = (double) samplerows *(double) d;
  
! 			denom = (double) (samplerows - f1) +
! 				(double) f1 *(double) samplerows / totalrows;
  
  			stadistinct = numer / denom;
  			/* Clamp to sane range in case of roundoff error */
  			if (stadistinct < (double) d)
  				stadistinct = (double) d;
! 			if (stadistinct > totalrows)
! 				stadistinct = totalrows;
  			stats->stadistinct = floor(stadistinct + 0.5);
  		}
  
--- 2442,2462 ----
  						denom,
  						stadistinct;
  
! 			double		samplerows_nonnull = samplerows - null_cnt;
! 			double		totalrows_nonnull
! 							= totalrows * (1.0 - stats->stanullfrac);
! 
! 			numer = samplerows_nonnull * (double) d;
  
! 			denom = (samplerows_nonnull - f1) +
! 				(double) f1 * samplerows_nonnull / totalrows_nonnull;
  
  			stadistinct = numer / denom;
  			/* Clamp to sane range in case of roundoff error */
  			if (stadistinct < (double) d)
  				stadistinct = (double) d;
! 			if (stadistinct > totalrows_nonnull)
! 				stadistinct = totalrows_nonnull;
  			stats->stadistinct = floor(stadistinct + 0.5);
  		}
  
***************
*** 2464,2469 ****
--- 2482,2489 ----
  		 * emit duplicate histogram bin boundaries.  (We might end up with
  		 * duplicate histogram entries anyway, if the distribution is skewed;
  		 * but we prefer to treat such values as MCVs if at all possible.)
+ 		 * We also decrease ndistinct in the process such that going forward
+ 		 * it refers to the number of distinct values left for the histogram.
  		 */
  		if (track_cnt == ndistinct && toowide_cnt == 0 &&
  			stats->stadistinct > 0 &&
***************
*** 2471,2505 ****
  		{
  			/* Track list includes all values seen, and all will fit */
  			num_mcv = track_cnt;
  		}
  		else
  		{
! 			double		ndistinct = stats->stadistinct;
! 			double		avgcount,
! 						mincount,
! 						maxmincount;
  
- 			if (ndistinct < 0)
- 				ndistinct = -ndistinct * totalrows;
- 			/* estimate # of occurrences in sample of a typical value */
- 			avgcount = (double) samplerows / ndistinct;
- 			/* set minimum threshold count to store a value */
- 			mincount = avgcount * 1.25;
- 			if (mincount < 2)
- 				mincount = 2;
- 			/* don't let threshold exceed 1/K, however */
- 			maxmincount = (double) samplerows / (double) num_bins;
- 			if (mincount > maxmincount)
- 				mincount = maxmincount;
  			if (num_mcv > track_cnt)
  				num_mcv = track_cnt;
  			for (i = 0; i < num_mcv; i++)
  			{
! 				if (track[i].count < mincount)
  				{
! 					num_mcv = i;
! 					break;
  				}
  			}
  		}
  
--- 2491,2548 ----
  		{
  			/* Track list includes all values seen, and all will fit */
  			num_mcv = track_cnt;
+ 
+ 			/* Nothing left for the histogram */
+ 			num_hist = 0;
+ 			ndistinct = 0;
  		}
  		else
  		{
! 			/*
! 			 * Starting number of values left for the histogram: samplerows
! 			 * sans nulls and too wide ones.
! 			 */
! 			int			sample_cnt = values_cnt;
! 
! 			num_hist = ndistinct;
! 			if (num_hist > num_bins)
! 				num_hist = num_bins + 1;
  
  			if (num_mcv > track_cnt)
  				num_mcv = track_cnt;
  			for (i = 0; i < num_mcv; i++)
  			{
! 				if (num_hist >= 2)
  				{
! 					double		avgcount,
! 								mincount,
! 								maxmincount;
! 
! 					/* estimate # of occurrences in sample of a typical value */
! 					avgcount = (double) sample_cnt / (double) ndistinct;
! 
! 					/* set minimum threshold count to store a value */
! 					mincount = 1.25 * avgcount;
! 
! 					/* don't let threshold exceed 1/K, however */
! 					maxmincount = (sample_cnt - 1) / (double) (num_hist - 1);
! 					if (mincount > maxmincount)
! 						mincount = maxmincount;
! 					if (track[i].count < mincount)
! 					{
! 						num_mcv = i;
! 						break;
! 					}
  				}
+ 
+ 				/* Narrow our view of samples left for the histogram */
+ 				sample_cnt -= track[i].count;
+ 				ndistinct--;
+ 
+ 				/* Recalculate histogram size due to lower ndistinct */
+ 				num_hist = ndistinct;
+ 				if (num_hist > num_bins)
+ 					num_hist = num_bins + 1;
  			}
  		}
  
***************
*** 2542,2550 ****
  		 * values not accounted for in the MCV list.  (This ensures the
  		 * histogram won't collapse to empty or a singleton.)
  		 */
- 		num_hist = ndistinct - num_mcv;
- 		if (num_hist > num_bins)
- 			num_hist = num_bins + 1;
  		if (num_hist >= 2)
  		{
  			MemoryContext old_context;
--- 2585,2590 ----
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to