Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Simon Riggs Fri, 13 Jan 2006 11:19:24 -0800

On Fri, 2006-01-06 at 15:26 -0800, Josh Berkus wrote:

> Anyway, since the proof is in the pudding, Simon and I will be working on 
> some demo code for different sampling methods so that we can debate 
> results rather than theory.


I enclose a patch for checking out block sampling. This is not
production ready, yet, but works bug-free on cvstip. Code comments have
been fully updated to explain what's going on inside.

All you need to do is "set analyze_blocks=b" and ANALYZE will switch
over to using block sampling method and will read all the rows in "b"
blocks. The sample size will also be limited by maintenance_work_mem.
(Memory limitations could be smarter). This de-couples the specification
of the sample size from the specification of the MCV/histogram size
(stats_target).
[Right now, I'm not suggesting that we have a GUC named this - it just
exists for testing. If/when we agree to allow block sampling, then we
can discuss how to invoke/specify it]

The stats calculations aren't touched - it still uses Haas-Stokes.
If you "set log_min_messages=DEBUG2" you'll also get more useful
explanations of what the variables are and what decisions it makes about
D for each column being analyzed.

This patch has two main effects:
- ANALYZE runs more than x10 faster to retrieve the same size sample
- you can specify much larger samples for bigger tables, without
increasing the stats targets

Generally, the larger samples will give better results for the
estimation. However, what is immediately apparent is that the
Haas-Stokes estimator actually gets even worse with block sampling in
the particular case I raised on-list. (Which is understandable, but not
desirable). ISTM this is a strike against Haas-Stokes, rather than a
strike against block sampling. So I'm looking at implementing the
Brutlag & Richardson estimator(s) that cope with
number-of-values-appearing in only one block. Not surprisingly that
means some non-trivial additional code to retrieve blockids for each
tuple and make decisions accordingly. I plan to use a similar technique
to the existing TupnoLink array to match blockids.

The B&R estimators should allow a fairly fast, small sample to be taken,
making this more usable for dynamic sampling during query planning (when
desirable, see recent -perform thread).

It's also worth mentioning that for datatypes that only have an "="
operator the performance of compute_minimal_stats is O(N^2) when values
are unique, so increasing sample size is a very bad idea in that case.
It may be possible to re-sample the sample, so that we get only one row
per block as with the current row sampling method. Another idea might be
just to abort the analysis when it looks fairly unique, rather than
churn through the whole sample.

Best Regards, Simon Riggs

Index: src/backend/commands/analyze.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/analyze.c,v
retrieving revision 1.90
diff -c -r1.90 analyze.c
*** src/backend/commands/analyze.c	22 Nov 2005 18:17:08 -0000	1.90
--- src/backend/commands/analyze.c	13 Jan 2006 18:54:26 -0000
***************
*** 62,67 ****
--- 62,68 ----
  
  /* Default statistics target (GUC parameter) */
  int			default_statistics_target = 10;
+ int         analyze_blocks = 0;
  
  static int	elevel = -1;
  
***************
*** 77,84 ****
  					HeapTuple *rows, int numrows,
  					MemoryContext col_context);
  static VacAttrStats *examine_attribute(Relation onerel, int attnum);
! static int acquire_sample_rows(Relation onerel, HeapTuple *rows,
! 					int targrows, double *totalrows, double *totaldeadrows);
  static double random_fract(void);
  static double init_selection_state(int n);
  static double get_next_S(double t, int n, double *stateptr);
--- 78,86 ----
  					HeapTuple *rows, int numrows,
  					MemoryContext col_context);
  static VacAttrStats *examine_attribute(Relation onerel, int attnum);
! static HeapTuple *acquire_sample_rows(Relation onerel, 
!                     int targblocks, int targrows, long allowedMem,
!                     int *samplerows, double *totalrows, double *totaldeadrows);
  static double random_fract(void);
  static double init_selection_state(int n);
  static double get_next_S(double t, int n, double *stateptr);
***************
*** 108,117 ****
--- 110,122 ----
  	VacAttrStats **vacattrstats;
  	AnlIndexData *indexdata;
  	int			targrows,
+                 targblocks,
  				numrows;
  	double		totalrows,
  				totaldeadrows;
  	HeapTuple  *rows;
+ 	long		allowedMem = 0;		/* total memory allowed, in bytes */
+ 
  
  	if (vacstmt->verbose)
  		elevel = INFO;
***************
*** 348,357 ****
  	/*
  	 * Acquire the sample rows
  	 */
- 	rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
- 	numrows = acquire_sample_rows(onerel, rows, targrows,
- 								  &totalrows, &totaldeadrows);
  
  	/*
  	 * Compute the statistics.	Temporary results during the calculations for
  	 * each column are stored in a child context.  The calc routines are
--- 353,382 ----
  	/*
  	 * Acquire the sample rows
  	 */
  
+     /*
+      * XXX Temporary way of specifying number of blocks to ANALYZE
+      */
+     if (analyze_blocks > 0)
+     {
+         /*
+          * Block sampling
+          */
+         targrows = 0;
+         targblocks = analyze_blocks;
+     }
+     else
+         /*
+          * Bernoulli row sampling
+          */
+         targblocks = targrows;
+ 
+     allowedMem = maintenance_work_mem * 1024L;
+ 
+ 	rows = acquire_sample_rows(onerel, 
+                                   targblocks, targrows, allowedMem,
+ 								  &numrows, &totalrows, &totaldeadrows);
+        
  	/*
  	 * Compute the statistics.	Temporary results during the calculations for
  	 * each column are stored in a child context.  The calc routines are
***************
*** 747,771 ****
  }
  
  /*
!  * acquire_sample_rows -- acquire a random sample of rows from the table
   *
!  * As of May 2004 we use a new two-stage method:  Stage one selects up
!  * to targrows random blocks (or all blocks, if there aren't so many).
!  * Stage two scans these blocks and uses the Vitter algorithm to create
   * a random sample of targrows rows (or less, if there are less in the
!  * sample of blocks).  The two stages are executed simultaneously: each
!  * block is processed as soon as stage one returns its number and while
!  * the rows are read stage two controls which ones are to be inserted
!  * into the sample.
   *
   * Although every row has an equal chance of ending up in the final
   * sample, this sampling method is not perfect: not every possible
   * sample has an equal chance of being selected.  For large relations
   * the number of different blocks represented by the sample tends to be
!  * too small.  We can live with that for now.  Improvements are welcome.
!  *
!  * We also estimate the total numbers of live and dead rows in the table,
!  * and return them into *totalrows and *totaldeadrows, respectively.
   *
   * An important property of this sampling method is that because we do
   * look at a statistically unbiased set of blocks, we should get
--- 772,799 ----
  }
  
  /*
!  * acquire_sample_rows -- acquire sample rows from table (heap)
!  *
!  * We use one of two methods: Row or Block Sampling
!  *
!  * The implementation for both follows a similar two-stage process:
!  * Stage one selects up to targblocks random blocks (or all blocks, 
!  * if there aren't so many). Stage one is common to both methods.
!  * Stage two scans these blocks and retrieves rows from them.
!  * The two stages are executed simultaneously: each block is processed 
!  * as soon as stage one returns and while the rows are read stage two 
!  * controls which ones are to be inserted into the sample.
   *
!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create
   * a random sample of targrows rows (or less, if there are less in the
!  * sample of blocks). In this case, targblocks is always the same as
!  * targrows, so we always read one row per block.
   *
   * Although every row has an equal chance of ending up in the final
   * sample, this sampling method is not perfect: not every possible
   * sample has an equal chance of being selected.  For large relations
   * the number of different blocks represented by the sample tends to be
!  * too small.  In that case, block sampling should be used.
   *
   * An important property of this sampling method is that because we do
   * look at a statistically unbiased set of blocks, we should get
***************
*** 773,784 ****
   * block.  The previous sampling method put too much credence in the row
   * density near the start of the table.
   *
   * The returned list of tuples is in order by physical position in the table.
!  * (We will rely on this later to derive correlation estimates.)
   */
! static int
! acquire_sample_rows(Relation onerel, HeapTuple *rows, int targrows,
! 					double *totalrows, double *totaldeadrows)
  {
  	int			numrows = 0;	/* # rows collected */
  	double		liverows = 0;	/* # rows seen */
--- 801,831 ----
   * block.  The previous sampling method put too much credence in the row
   * density near the start of the table.
   *
+  * Block Sampling: Block sampling will scan all rows in each of the blocks
+  * scanned. This doesn't provide the same unbiased set of rows that 
+  * Row Sampling achieves, though this has a number of advantages. First,
+  * the selection bias can help to detect clustered data that more widely
+  * spread row samples might miss. Second, a larger sample can be
+  * acquired with much reduced I/O, allowing large samples to be practically
+  * achieved with reasonable performance.
+  *
+  * Both sampling methods are constrained by allowedMem, as defined by the
+  * caller. ANALYZE will use maintenance_work_mem.
+  * This restriction allows the sample to be easily sorted to allow
+  * derivation of per-column statistics. The randomness of the sample is
+  * not altered by this restriction, since it is applied before sampling begins.
+  *
+  * We also estimate the total numbers of live and dead rows in the table,
+  * and return them into *totalrows and *totaldeadrows, respectively.
+  *
   * The returned list of tuples is in order by physical position in the table.
!  * (We will rely on this later to derive correlation estimates.) This is
!  * achieved by sorting the result set when using Row Sampling. This step
!  * is not required when using Block Sampling.
   */
! static HeapTuple *acquire_sample_rows(Relation onerel, 
!                     int targblocks, int targrows, long allowedMem,
!                     int *samplerows, double *totalrows, double *totaldeadrows)
  {
  	int			numrows = 0;	/* # rows collected */
  	double		liverows = 0;	/* # rows seen */
***************
*** 787,804 ****
  	BlockNumber totalblocks;
  	BlockSamplerData bs;
  	double		rstate;
  
! 	Assert(targrows > 1);
  
  	totalblocks = RelationGetNumberOfBlocks(onerel);
  
  	/* Prepare for sampling block numbers */
! 	BlockSampler_Init(&bs, totalblocks, targrows);
  	/* Prepare for sampling rows */
  	rstate = init_selection_state(targrows);
  
  	/* Outer loop over blocks to sample */
! 	while (BlockSampler_HasMore(&bs))
  	{
  		BlockNumber targblock = BlockSampler_Next(&bs);
  		Buffer		targbuffer;
--- 834,891 ----
  	BlockNumber totalblocks;
  	BlockSamplerData bs;
  	double		rstate;
+     bool        doBlockSampling = false;
+     bool        moreRows = true;
+     Size        memreqperblock;
+     HeapTuple   *rows;
+     int         allocrows;
  
! 	Assert(targblocks > 0);
!     Assert(targrows > 0);
  
  	totalblocks = RelationGetNumberOfBlocks(onerel);
  
+     /*
+      * Reduce any sample requests that would not fit in availMem
+      * then allocate array for sample rows
+      */
+     if (targblocks != targrows)
+     {
+         doBlockSampling = true;
+ 
+         /* Assume initially that the on-disk representation of the data will
+          * be the same as the in-memory representation and that
+          * there will be at most 128 rows per block...
+          */
+         memreqperblock = BLCKSZ + (128 * sizeof(HeapTuple));
+ 
+         if ((long)(targblocks * memreqperblock) > allowedMem)
+         {
+             targblocks = (int) ( allowedMem / (long) memreqperblock );
+             allocrows = targblocks * 128;
+         }
+         else
+             /* Be more generous if we are not at memory limit */
+             allocrows = targblocks * MaxHeapTuplesPerPage;
+     }
+     else
+     {
+         allocrows = targrows;
+         /* Assume that we will have rows at least 64 bytes wide 
+          * Currently we're very unlikely to overflow availMem here 
+          */
+         if ((allocrows * sizeof(HeapTuple)) > (allowedMem >> 4))
+             allocrows = (allowedMem >> 4)/sizeof(HeapTuple);
+     }
+    	rows = (HeapTuple *) palloc(allocrows * sizeof(HeapTuple));
+     
  	/* Prepare for sampling block numbers */
! 	BlockSampler_Init(&bs, totalblocks, targblocks);
  	/* Prepare for sampling rows */
  	rstate = init_selection_state(targrows);
  
  	/* Outer loop over blocks to sample */
! 	while (BlockSampler_HasMore(&bs) && moreRows)
  	{
  		BlockNumber targblock = BlockSampler_Next(&bs);
  		Buffer		targbuffer;
***************
*** 831,840 ****
  			if (heap_release_fetch(onerel, SnapshotNow,
  								   &targtuple, &targbuffer,
  								   true, NULL))
! 			{
  				/*
! 				 * The first targrows live rows are simply copied into the
! 				 * reservoir. Then we start replacing tuples in the sample
  				 * until we reach the end of the relation.	This algorithm is
  				 * from Jeff Vitter's paper (see full citation below). It
  				 * works by repeatedly computing the number of tuples to skip
--- 918,943 ----
  			if (heap_release_fetch(onerel, SnapshotNow,
  								   &targtuple, &targbuffer,
  								   true, NULL))
! 			{        
  				/*
! 				 * For Block Sampling, we stop when we run out of memory.
!                  * This means we may not fully sample all of the blocks
!                  * requested, which could further bias the sample.
!                  *
!                  * XXX measure memory usage as we proceed, possibly
!                  * altering the number of rows per block so that we scan
!                  * all blocks, limiting rows-per-block to fit in availMem
!                  */
!                 if (doBlockSampling)
!                 {
!                     if (numrows < allocrows)
!        					rows[numrows++] = heap_copytuple(&targtuple);
!                     else
!                         moreRows = false;
!                 }
!                 /* For Row Sampling, the first targrows live rows are simply 
!                  * copied into the reservoir. We stop at the requested limit, 
!                  * then we start replacing tuples in the sample
  				 * until we reach the end of the relation.	This algorithm is
  				 * from Jeff Vitter's paper (see full citation below). It
  				 * works by repeatedly computing the number of tuples to skip
***************
*** 844,851 ****
  				 * we've passed over so far, so when we fall off the end of
  				 * the relation we're done.
  				 */
! 				if (numrows < targrows)
! 					rows[numrows++] = heap_copytuple(&targtuple);
  				else
  				{
  					/*
--- 947,954 ----
  				 * we've passed over so far, so when we fall off the end of
  				 * the relation we're done.
  				 */
! 				else if (numrows < targrows)
!    					rows[numrows++] = heap_copytuple(&targtuple);
  				else
  				{
  					/*
***************
*** 888,907 ****
  	}
  
  	/*
! 	 * If we didn't find as many tuples as we wanted then we're done. No sort
  	 * is needed, since they're already in order.
  	 *
  	 * Otherwise we need to sort the collected tuples by position
  	 * (itempointer). It's not worth worrying about corner cases where the
  	 * tuples are already sorted.
  	 */
! 	if (numrows == targrows)
  		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
  
  	/*
  	 * Estimate total numbers of rows in relation.
  	 */
! 	if (bs.m > 0)
  	{
  		*totalrows = floor((liverows * totalblocks) / bs.m + 0.5);
  		*totaldeadrows = floor((deadrows * totalblocks) / bs.m + 0.5);
--- 991,1011 ----
  	}
  
  	/*
! 	 * If we are doing Block Sampling or if we are doing Row Sampling and we
!      * didn't find as many tuples as we wanted then we're done. No sort
  	 * is needed, since they're already in order.
  	 *
  	 * Otherwise we need to sort the collected tuples by position
  	 * (itempointer). It's not worth worrying about corner cases where the
  	 * tuples are already sorted.
  	 */
! 	if (numrows == targrows && !doBlockSampling)
  		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
  
  	/*
  	 * Estimate total numbers of rows in relation.
  	 */
! 	if (bs.m > 0 )
  	{
  		*totalrows = floor((liverows * totalblocks) / bs.m + 0.5);
  		*totaldeadrows = floor((deadrows * totalblocks) / bs.m + 0.5);
***************
*** 911,916 ****
--- 1015,1021 ----
  		*totalrows = 0.0;
  		*totaldeadrows = 0.0;
  	}
+     *samplerows = numrows;
  
  	/*
  	 * Emit some interesting relation info
***************
*** 924,930 ****
  					liverows, deadrows,
  					numrows, *totalrows)));
  
! 	return numrows;
  }
  
  /* Select a random value R uniformly distributed in 0 < R < 1 */
--- 1029,1035 ----
  					liverows, deadrows,
  					numrows, *totalrows)));
  
! 	return rows;
  }
  
  /* Select a random value R uniformly distributed in 0 < R < 1 */
***************
*** 1045,1051 ****
  }
  
  /*
!  * qsort comparator for sorting rows[] array
   */
  static int
  compare_rows(const void *a, const void *b)
--- 1150,1157 ----
  }
  
  /*
!  * qsort comparator for sorting rows[] array as last part of Row Sampling
!  * in acquire_sample_rows()
   */
  static int
  compare_rows(const void *a, const void *b)
***************
*** 1557,1564 ****
  	/* We can only compute real stats if we found some non-null values. */
  	if (nonnull_cnt > 0)
  	{
! 		int			nmultiple,
! 					summultiple;
  
  		stats->stats_valid = true;
  		/* Do the simple null-frac and width stats */
--- 1663,1672 ----
  	/* We can only compute real stats if we found some non-null values. */
  	if (nonnull_cnt > 0)
  	{
! 		int			nmultiple,      /* # values that occur more than once*/
!                     f1,             /* # unique values in sample */
!                     d;              /* # distinct values in sample */
!         int			summultiple;
  
  		stats->stats_valid = true;
  		/* Do the simple null-frac and width stats */
***************
*** 1577,1586 ****
--- 1685,1710 ----
  			summultiple += track[nmultiple].count;
  		}
  
+         f1 = nonnull_cnt - summultiple;
+     	d = f1 + nmultiple;
+ 
+         /* 
+          * Display the common statistical measures for the sample.
+          * These can be used to calculate various estimators of D (n_distinct)
+          * for comparison with Haas-Stokes Duj1 (see below)
+          */
+ 		ereport(DEBUG2,
+             (errmsg("ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u", 
+                         stats->tupattnum,samplerows, nmultiple, f1, d)));
+ 
  		if (nmultiple == 0)
  		{
  			/* If we found no repeated values, assume it's a unique column */
  			stats->stadistinct = -1.0;
+     
+     		ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u result: assume unique; scale by N", 
+                         stats->tupattnum)));
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
  				 nmultiple == track_cnt)
***************
*** 1591,1605 ****
  			 * these values.
  			 */
  			stats->stadistinct = track_cnt;
  		}
  		else
  		{
  			/*----------
! 			 * Estimate the number of distinct values using the estimator
  			 * proposed by Haas and Stokes in IBM Research Report RJ 10025:
  			 *		n*d / (n - f1 + f1*n/N)
  			 * where f1 is the number of distinct values that occurred
! 			 * exactly once in our sample of n rows (from a total of N),
  			 * and d is the total number of distinct values in the sample.
  			 * This is their Duj1 estimator; the other estimators they
  			 * recommend are considerably more complex, and are numerically
--- 1715,1733 ----
  			 * these values.
  			 */
  			stats->stadistinct = track_cnt;
+ 
+             ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u result: assume D=%u; does not scale", 
+                         stats->tupattnum, track_cnt)));
  		}
  		else
  		{
  			/*----------
! 			 * Estimate the number of distinct values, D, using the estimator
  			 * proposed by Haas and Stokes in IBM Research Report RJ 10025:
  			 *		n*d / (n - f1 + f1*n/N)
  			 * where f1 is the number of distinct values that occurred
! 			 * exactly once in our sample of n rows (from a total of N rows),
  			 * and d is the total number of distinct values in the sample.
  			 * This is their Duj1 estimator; the other estimators they
  			 * recommend are considerably more complex, and are numerically
***************
*** 1608,1619 ****
  			 * We assume (not very reliably!) that all the multiply-occurring
  			 * values are reflected in the final track[] list, and the other
  			 * nonnull values all appeared but once.  (XXX this usually
! 			 * results in a drastic overestimate of ndistinct.	Can we do
  			 * any better?)
  			 *----------
  			 */
- 			int			f1 = nonnull_cnt - summultiple;
- 			int			d = f1 + nmultiple;
  			double		numer,
  						denom,
  						stadistinct;
--- 1736,1745 ----
  			 * We assume (not very reliably!) that all the multiply-occurring
  			 * values are reflected in the final track[] list, and the other
  			 * nonnull values all appeared but once.  (XXX this usually
! 			 * results in a drastic overestimate of D.	Can we do
  			 * any better?)
  			 *----------
  			 */
  			double		numer,
  						denom,
  						stadistinct;
***************
*** 1630,1635 ****
--- 1756,1765 ----
  			if (stadistinct > totalrows)
  				stadistinct = totalrows;
  			stats->stadistinct = floor(stadistinct + 0.5);
+ 
+     		ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u estimator: HaasStokes Duj1 D=%f", 
+                         stats->tupattnum, stats->stadistinct)));
  		}
  
  		/*
***************
*** 1639,1646 ****
--- 1769,1782 ----
  		 * value.
  		 */
  		if (stats->stadistinct > 0.1 * totalrows)
+         {
  			stats->stadistinct = -(stats->stadistinct / totalrows);
  
+     		ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u heuristic: D=%f; scale by N", 
+                         stats->tupattnum,stats->stadistinct)));
+         }
+ 
  		/*
  		 * Decide how many values are worth storing as most-common values. If
  		 * we are able to generate a complete MCV list (all the values in the
***************
*** 1832,1841 ****
  	/* We can only compute real stats if we found some sortable values. */
  	if (values_cnt > 0)
  	{
! 		int			ndistinct,	/* # distinct values in sample */
! 					nmultiple,	/* # that appear multiple times */
! 					num_hist,
  					dups_cnt;
  		int			slot_idx = 0;
  
  		/* Sort the collected values */
--- 1968,1978 ----
  	/* We can only compute real stats if we found some sortable values. */
  	if (values_cnt > 0)
  	{
!         int			num_hist,
  					dups_cnt;
+ 		int			nmultiple,      /* # values that occur more than once*/
+                     f1,             /* # unique values in sample */
+                     ndistinct;      /* # distinct values in sample */
  		int			slot_idx = 0;
  
  		/* Sort the collected values */
***************
*** 1917,1926 ****
--- 2054,2081 ----
  		else
  			stats->stawidth = stats->attrtype->typlen;
  
+         /*
+          * We assume that very wide values are distinct from each other
+          */
+ 		f1 = ndistinct - nmultiple + toowide_cnt;
+ 
+         /* 
+          * Display the common statistical measures for the sample.
+          * These can be used to calculate various estimators of D (n_distinct)
+          * for comparison with Haas-Stokes Duj1 (see below)
+          */
+ 		ereport(DEBUG2,
+             (errmsg("ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u", 
+                         stats->tupattnum,samplerows, nmultiple, f1, nmultiple)));
+     
  		if (nmultiple == 0)
  		{
  			/* If we found no repeated values, assume it's a unique column */
  			stats->stadistinct = -1.0;
+ 
+     		ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u result: assume unique; scale by N", 
+                         stats->tupattnum)));
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
***************
*** 1929,1943 ****
  			 * column has just these values.
  			 */
  			stats->stadistinct = ndistinct;
  		}
  		else
  		{
  			/*----------
! 			 * Estimate the number of distinct values using the estimator
  			 * proposed by Haas and Stokes in IBM Research Report RJ 10025:
  			 *		n*d / (n - f1 + f1*n/N)
  			 * where f1 is the number of distinct values that occurred
! 			 * exactly once in our sample of n rows (from a total of N),
  			 * and d is the total number of distinct values in the sample.
  			 * This is their Duj1 estimator; the other estimators they
  			 * recommend are considerably more complex, and are numerically
--- 2084,2101 ----
  			 * column has just these values.
  			 */
  			stats->stadistinct = ndistinct;
+             ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u result: assume D=%u; does not scale", 
+                         stats->tupattnum, track_cnt)));
  		}
  		else
  		{
  			/*----------
! 			 * Estimate the number of distinct values, D, using the estimator
  			 * proposed by Haas and Stokes in IBM Research Report RJ 10025:
  			 *		n*d / (n - f1 + f1*n/N)
  			 * where f1 is the number of distinct values that occurred
! 			 * exactly once in our sample of n rows (from a total of N rows),
  			 * and d is the total number of distinct values in the sample.
  			 * This is their Duj1 estimator; the other estimators they
  			 * recommend are considerably more complex, and are numerically
***************
*** 1946,1953 ****
  			 * Overwidth values are assumed to have been distinct.
  			 *----------
  			 */
! 			int			f1 = ndistinct - nmultiple + toowide_cnt;
! 			int			d = f1 + nmultiple;
  			double		numer,
  						denom,
  						stadistinct;
--- 2104,2110 ----
  			 * Overwidth values are assumed to have been distinct.
  			 *----------
  			 */
!             int         d = f1 + nmultiple;
  			double		numer,
  						denom,
  						stadistinct;
***************
*** 1964,1969 ****
--- 2121,2129 ----
  			if (stadistinct > totalrows)
  				stadistinct = totalrows;
  			stats->stadistinct = floor(stadistinct + 0.5);
+     		ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u estimator: HaasStokes Duj1 D=%f", 
+                         stats->tupattnum, stats->stadistinct)));
  		}
  
  		/*
***************
*** 1973,1980 ****
--- 2133,2146 ----
  		 * value.
  		 */
  		if (stats->stadistinct > 0.1 * totalrows)
+         {
  			stats->stadistinct = -(stats->stadistinct / totalrows);
  
+     		ereport(DEBUG2,
+                 (errmsg("ANALYZE attr %u heuristic: D=%f; scale by N", 
+                         stats->tupattnum,stats->stadistinct)));
+         }
+ 
  		/*
  		 * Decide how many values are worth storing as most-common values. If
  		 * we are able to generate a complete MCV list (all the values in the
***************
*** 2204,2212 ****
  compare_scalars(const void *a, const void *b)
  {
  	Datum		da = ((ScalarItem *) a)->value;
- 	int			ta = ((ScalarItem *) a)->tupno;
  	Datum		db = ((ScalarItem *) b)->value;
! 	int			tb = ((ScalarItem *) b)->tupno;
  	int32		compare;
  
  	compare = ApplySortFunction(datumCmpFn, datumCmpFnKind,
--- 2370,2379 ----
  compare_scalars(const void *a, const void *b)
  {
  	Datum		da = ((ScalarItem *) a)->value;
  	Datum		db = ((ScalarItem *) b)->value;
! 
! 	int			ta,
! 				tb;
  	int32		compare;
  
  	compare = ApplySortFunction(datumCmpFn, datumCmpFnKind,
***************
*** 2217,2222 ****
--- 2384,2392 ----
  	/*
  	 * The two datums are equal, so update datumCmpTupnoLink[].
  	 */
+ 	ta = ((ScalarItem *) a)->tupno;
+ 	tb = ((ScalarItem *) b)->tupno;
+ 
  	if (datumCmpTupnoLink[ta] < tb)
  		datumCmpTupnoLink[ta] = tb;
  	if (datumCmpTupnoLink[tb] < ta)
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.309
diff -c -r1.309 guc.c
*** src/backend/utils/misc/guc.c	9 Jan 2006 10:05:31 -0000	1.309
--- src/backend/utils/misc/guc.c	13 Jan 2006 18:54:32 -0000
***************
*** 1022,1027 ****
--- 1022,1035 ----
  		0, 0, INT_MAX, NULL, NULL
  	},
  	{
+ 		{"analyze_blocks", PGC_USERSET, QUERY_TUNING_OTHER,
+ 			gettext_noop("sets the number of blocks to analyze"),
+             NULL
+ 		},
+ 		&analyze_blocks,
+ 		0, 0, INT_MAX, NULL, NULL
+ 	},
+ 	{
  		{"default_statistics_target", PGC_USERSET, QUERY_TUNING_OTHER,
  			gettext_noop("Sets the default statistics target."),
  			gettext_noop("This applies to table columns that have not had a "
Index: src/include/commands/vacuum.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/commands/vacuum.h,v
retrieving revision 1.62
diff -c -r1.62 vacuum.h
*** src/include/commands/vacuum.h	15 Oct 2005 02:49:44 -0000	1.62
--- src/include/commands/vacuum.h	13 Jan 2006 18:54:33 -0000
***************
*** 107,112 ****
--- 107,113 ----
  
  /* Default statistics target (GUC parameter) */
  extern DLLIMPORT int default_statistics_target; /* DLLIMPORT for PostGIS */
+ extern DLLIMPORT int analyze_blocks;            /* DLLIMPORT for PostGIS */
  
  
  /* in commands/vacuum.c */

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Reply via email to