Re: [HACKERS] Statistics and selectivity estimation for ranges

Heikki Linnakangas Tue, 12 Mar 2013 09:04:43 -0700

On 01.03.2013 16:22, Alexander Korotkov wrote:

I've been staring at this code for a very long time now, trying to
understand how the math in calc_hist_selectivity_contained works. I
think I understand it now, but it probably needs a lot more comments and
perhaps some refactoring, so that the next reader won't need to spend
hours deciphering it.


I'll walk through an example of a calc_hist_selectivity_contained
invocation, to verify that my understanding is correct. This isn't 100%
identical to how the function works; I explain it as if it holds some
temporary bins in memory and modifies them in steps, but in reality, it
keeps the bin distances and some other information only for the
current/previous bin it's processing, in local variables.

Assume that the query is "col <@ int4range(15, 50)", and the lower
bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram
looks like this:


Boundary 10 20 40 100 120
-+------+------+------+------+-
Fraction 0.25 0.25 0.25 0.25

Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same
proportion, 25%, of all the tuples in the table. The function first
finds the bins containing the lower and upper bounds, 15 and 55. All the
bins outside those bounds are ignored, as there are no matching tuples
there. The fractions and the bounds of first and last bin, ie. those
containing the lower and upper bounds, are adjusted according to the
boundary values, using linear interpolation. The lower bound, 15, falls
in the middle of the bin 10-20, and the upper bound, 55, splits the
40-100 bin at ratio of 1/5. The adjusted bins look like this:

Boundary 15 20 40 55
-+------+------+------+
Fraction 0.125 0.25 0.05

Next, we need to calculate what proportion of tuples in each bin has a
small enough length to be contained withing the query range. For that,
the distance of each bin boundary to the upper bound is calculated:

Distance 40 35 15 0
-+------+------+------+
Fraction 0.125 0.25 0.05

The bins are walked starting from the highest bin, ie. starting from
distance 0, walking up in increasing order of distance. For each bin,
the proportion of tuples within that range that have a suitable length
is calculated. The calc_length_hist_frac function does that. That
calculation is more complicated than it sounds: for example, for the
middle bin above, calc_length_hist_frac is passed both distance
boundaries, 15 and 35. calc_length_hist frac calculates the average of
P(x), when x slides linearly from 15 to 35, where P(x) is the fraction
of tuples with length <= x.

Now, here's a question, on something I didn't quite understand:

* Returns average fraction of histogram which is greater than given
length.
* While this length is increasing from length1 to *length2. If histogram
* ends up before *length2 then set covered fraction of (length1,
*length2)
* interval to *fraction and set end of histogram to *length2.
*/
static double
calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
double length1, double *length2, double *fraction)
{


Why the behavior explained in the last sentence in the above comment? It
seems that the abstraction provided by calc_length_hist_frac() is leaky;
the caller shouldn't need to know anything about the boundaries of the
length bins.

Ignoring that, I believe that calc_length_hist_frac can also be
explained like this:

/*
* Let P(x) be the fraction of tuples with length <= x.
*
* calc_length_hist_frac calculates the average of P(x), in the
interval [A, B].
*
* This can be calculated by the formula:
*
* B
* 1 /
* ------- | P(x)dx
* B - A /
* A
*/
static double
calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
double A, double B)


Am I correct this far? The function doesn't use the above formula as is,
but it could..

I'll continue trying to understand this and add comments..

So, after some hacking, I ended up with this version. I find it morereadable, I hope I didn't miss anything. This seems to produce resultsthat are close, but not identical, to the original patch. I'm not surewhere the discrepancy is coming from, or which patch is more correct inthat respect. I'll continue from this tomorrow, but if you have thetime, please take a look and let me know what you think.


- Heikki

*** a/src/backend/utils/adt/rangetypes_selfuncs.c
--- b/src/backend/utils/adt/rangetypes_selfuncs.c
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/htup_details.h"
  #include "catalog/pg_operator.h"
  #include "catalog/pg_statistic.h"
+ #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  #include "utils/rangetypes.h"
  #include "utils/selfuncs.h"
***************
*** 39,44 **** static int rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value,
--- 40,58 ----
  			   RangeBound *hist, int hist_length, bool equal);
  static float8 get_position(TypeCacheEntry *typcache, RangeBound *value,
  			 RangeBound *hist1, RangeBound *hist2);
+ static float8 get_len_position(double value, double hist1, double hist2);
+ static float8 get_distance(TypeCacheEntry *typcache, RangeBound *bound1,
+ 															RangeBound *bound2);
+ static int length_hist_bsearch(Datum *length_hist_values,
+ 										int length_hist_nvalues, double value);
+ static double calc_length_hist_frac(Datum *length_hist_values,
+ 	int length_hist_nvalues, double length1, double length2);
+ static double calc_hist_selectivity_contained(TypeCacheEntry *typcache,
+ 				RangeBound *lower, RangeBound *upper, RangeBound *hist_lower,
+ 		int hist_nvalues, Datum *length_hist_values, int length_hist_nvalues);
+ static double calc_hist_selectivity_contains(TypeCacheEntry *typcache,
+ 				RangeBound *lower, RangeBound *upper, RangeBound *hist_lower,
+ 		int hist_nvalues, Datum *length_hist_values, int length_hist_nvalues);
  
  /*
   * Returns a default selectivity estimate for given operator, when we don't
***************
*** 213,219 **** calc_rangesel(TypeCacheEntry *typcache, VariableStatData *vardata,
  		/* Try to get fraction of empty ranges */
  		if (get_attstatsslot(vardata->statsTuple,
  							 vardata->atttype, vardata->atttypmod,
! 							 STATISTIC_KIND_RANGE_EMPTY_FRAC, InvalidOid,
  							 NULL,
  							 NULL, NULL,
  							 &numbers, &nnumbers))
--- 227,233 ----
  		/* Try to get fraction of empty ranges */
  		if (get_attstatsslot(vardata->statsTuple,
  							 vardata->atttype, vardata->atttypmod,
! 							 STATISTIC_KIND_LENGTH_HISTOGRAM, InvalidOid,
  							 NULL,
  							 NULL, NULL,
  							 &numbers, &nnumbers))
***************
*** 332,337 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 346,353 ----
  {
  	Datum	   *hist_values;
  	int			nhist;
+ 	Datum	   *length_hist_values;
+ 	int			length_nhist;
  	RangeBound *hist_lower;
  	RangeBound *hist_upper;
  	int			i;
***************
*** 365,370 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 381,400 ----
  			elog(ERROR, "bounds histogram contains an empty range");
  	}
  
+ 	/* @> and @< also need a histogram of range lengths */
+ 	if (operator == OID_RANGE_CONTAINS_OP ||
+ 		operator == OID_RANGE_CONTAINED_OP)
+ 	{
+ 		if (!(HeapTupleIsValid(vardata->statsTuple) &&
+ 			  get_attstatsslot(vardata->statsTuple,
+ 							   vardata->atttype, vardata->atttypmod,
+ 							   STATISTIC_KIND_LENGTH_HISTOGRAM, InvalidOid,
+ 							   NULL,
+ 							   &length_hist_values, &length_nhist,
+ 							   NULL, NULL)))
+ 			return -1.0;
+ 	}
+ 
  	/* Extract the bounds of the constant value. */
  	range_deserialize(typcache, constval, &const_lower, &const_upper, &empty);
  	Assert (!empty);
***************
*** 440,445 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
--- 470,478 ----
  			/*
  			 * A && B <=> NOT (A << B OR A >> B).
  			 *
+ 			 * Since A << B and A >> B are mutually exclusive events we can sum
+ 			 * their probabilities to find probability of (A << B OR A >> B).
+ 			 *
  			 * "range @> elem" is equivalent to "range && [elem,elem]". The
  			 * caller already constructed the singular range from the element
  			 * constant, so just treat it the same as &&.
***************
*** 454,462 **** calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
  			break;
  
  		case OID_RANGE_CONTAINS_OP:
  		case OID_RANGE_CONTAINED_OP:
! 			/* TODO: not implemented yet */
! 			hist_selec = -1.0;
  			break;
  
  		default:
--- 487,522 ----
  			break;
  
  		case OID_RANGE_CONTAINS_OP:
+ 			hist_selec =
+ 				calc_hist_selectivity_contains(typcache, &const_lower,
+ 											&const_upper, hist_lower, nhist,
+ 											length_hist_values, length_nhist);
+ 			break;
+ 
  		case OID_RANGE_CONTAINED_OP:
! 			if (const_lower.infinite)
! 			{
! 				/*
! 				 * Lower bound no longer matters. Just estimate the fraction
! 				 * with an upper bound <= const uppert bound
! 				 */
! 				hist_selec =
! 					calc_hist_selectivity_scalar(typcache, &const_upper,
! 												 hist_upper, nhist, true);
! 			}
! 			else if (const_upper.infinite)
! 			{
! 				hist_selec =
! 					1.0 - calc_hist_selectivity_scalar(typcache, &const_lower,
! 													   hist_lower, nhist, false);
! 			}
! 			else
! 			{
! 				hist_selec =
! 					calc_hist_selectivity_contained(typcache, &const_lower,
! 													&const_upper, hist_lower, nhist,
! 													length_hist_values, length_nhist);
! 			}
  			break;
  
  		default:
***************
*** 497,504 **** calc_hist_selectivity_scalar(TypeCacheEntry *typcache, RangeBound *constbound,
  
  /*
   * Binary search on an array of range bounds. Returns greatest index of range
!  * bound in array which is less than given range bound. If all range bounds in
!  * array are greater or equal than given range bound, return -1.
   */
  static int
  rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
--- 557,569 ----
  
  /*
   * Binary search on an array of range bounds. Returns greatest index of range
!  * bound in array which is less(less or equal) than given range bound. If all
!  * range bounds in array are greater or equal(greater) than given range bound,
!  * return -1. When "equal" flag is set conditions in brackets are used.
!  *
!  * This function is used in scalar operators selectivity estimation. Another
!  * goal of this function is to found an histogram bin where to stop
!  * interpolation of portion of bounds which are less or equal to given bound.
   */
  static int
  rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
***************
*** 522,527 **** rbound_bsearch(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist,
--- 587,617 ----
  	return lower;
  }
  
+ 
+ /*
+  * Binary search on length histogram. Returns greatest index of range length in
+  * histogram which is less than the given length value.
+  */
+ static int
+ length_hist_bsearch(Datum *length_hist_values, int length_hist_nvalues,
+ 					double value)
+ {
+ 	int			lower = -1,
+ 				upper = length_hist_nvalues - 1,
+ 				middle;
+ 
+ 	while (lower < upper)
+ 	{
+ 		middle = (lower + upper + 1) / 2;
+ 
+ 		if (DatumGetFloat8(length_hist_values[middle]) < value)
+ 			lower = middle;
+ 		else
+ 			upper = middle - 1;
+ 	}
+ 	return lower;
+ }
+ 
  /*
   * Get relative position of value in histogram bin in [0,1] range.
   */
***************
*** 598,600 **** get_position(TypeCacheEntry *typcache, RangeBound *value, RangeBound *hist1,
--- 688,1082 ----
  	}
  }
  
+ 
+ /*
+  * Get relative position of value in histogram bin in [0,1] range.
+  */
+ static double
+ get_len_position(double value, double hist1, double hist2)
+ {
+ 	if (!is_infinite(hist1) && !is_infinite(hist2))
+ 	{
+ 		/*
+ 		 * Both bounds are finite. Assuming the subtype's comparison function
+ 		 * works sanely, the value must be finite, too, because it lies
+ 		 * somewhere between the bounds. If it doesn't, just return something.
+ 		 */
+ 		if (is_infinite(value))
+ 			return 0.5;
+ 
+ 		/* Calculate relative position using subdiff function. */
+ 		{
+ 			double d = 1.0 - (hist2 - value) / (hist2 - hist1);
+ 			Assert(!is_infinite(d));
+ 			return d;
+ 		}
+ 	}
+ 	else if (is_infinite(hist1) && !is_infinite(hist2))
+ 	{
+ 		/*
+ 		 * Lower bin boundary is -infinite, upper is finite.
+ 		 * Return 1.0 to indicate the value is infinitely far from the lower
+ 		 * bound.
+ 		 */
+ 		return 1.0;
+ 	}
+ 	else if (is_infinite(hist1) && is_infinite(hist2))
+ 	{
+ 		/* same as above, but in reverse */
+ 		return 0.0;
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * If both bin boundaries are infinite, they should be equal to each
+ 		 * other, and the value should also be infinite and equal to both
+ 		 * bounds. (But don't Assert that, to avoid crashing if a user creates
+ 		 * a datatype with a broken comparison function).
+ 		 *
+ 		 * Assume the value to lie in the middle of the infinite bounds.
+ 		 */
+ 		return 0.5;
+ 	}
+ }
+ 
+ /*
+  * Measure distance between two range bounds.
+  */
+ static float8
+ get_distance(TypeCacheEntry *typcache, RangeBound *bound1, RangeBound *bound2)
+ {
+ 	bool	has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
+ 
+ 	if (!bound1->infinite && !bound2->infinite)
+ 	{
+ 		/*
+ 		 * No bounds are infinite, use subdiff function or return default
+ 		 * value of 1.0 if no subdiff is available.
+ 		 */
+ 		if (has_subdiff)
+ 			return
+ 				DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
+ 												 typcache->rng_collation,
+ 												 bound2->val,
+ 												 bound1->val));
+ 		else
+ 			return 1.0;
+ 	}
+ 	else if (bound1->infinite && bound2->infinite)
+ 	{
+ 		/* Both bounds are infinite */
+ 		if (bound1->lower == bound2->lower)
+ 			return 0.0;
+ 		else
+ 			return get_float8_infinity();
+ 	}
+ 	else
+ 	{
+ 		/* One bound is infinite, another is not */
+ 		return get_float8_infinity();
+ 	}
+ }
+ 
+ /*
+  * Calculate the average of function P(x), in the interval [length1, length2],
+  * where P(x) is the fraction of tuples with length <= x.
+  */
+ static double
+ calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
+ 					  double length1, double length2)
+ {
+ 	double		frac;
+ 	double		A, B, PA, PB;
+ 	double		pos;
+ 	int			i;
+ 	double		area;
+ 
+ 	Assert(length2 >= length1);
+ 
+ 	/*----------
+ 	 * The average of a function between A and B can be calculated by the
+ 	 * formula:
+ 	 *
+ 	 *          B
+ 	 *    1     /
+ 	 * -------  | P(x)dx
+ 	 *  B - A   /
+ 	 *          A
+ 	 *
+ 	 * The geometrical interpretation is that the integral is the area under
+ 	 * the graph of P(x). P(x) is defined by the length histogram. We
+ 	 * calculate the area in a piecewise fashion, iterating through the length
+ 	 * histogram bins. Each bin is a trapezoid:
+ 	 *
+ 	 *       P(x2)
+ 	 *        /|
+ 	 *       / |
+ 	 * P(x1)/  |
+ 	 *     |   |
+ 	 *     |   |
+ 	 *  ---+---+--
+ 	 *     x1  x2
+ 	 *
+ 	 * where x1 and x2 are the boundaries of the current histogram, and P(x1)
+ 	 * and P(x1) are the cumulative fraction of tuples at the boundaries.
+ 	 *
+ 	 * The area of each trapezoid is 1/2 * (P(x2) + P(x1)) * (x2 - x1)
+ 	 *
+ 	 * The first bin contains the lower bound passed by the caller, so we
+ 	 * use linear interpolation between the previous and next histogram bin
+ 	 * boundary to calculate P(x1). Likewise for the last bin: we use linear
+ 	 * interpolation to calculate P(x2). For the bins in between, x1 and x2
+ 	 * lie on histogram bin boundaries, so P(x1) and P(x2) are simply:
+ 	 * P(x1) =    (bin index) / (number of bins)
+ 	 * P(x2) = (bin index + 1 / (number of bins)
+ 	 */
+ 
+ 	/* First bin, the one that contains lower bound */
+ 	i = length_hist_bsearch(length_hist_values, length_hist_nvalues, length1);
+ 	if (i >= length_hist_nvalues - 1)
+ 		return 1.0;
+ 
+ 	if (i < 0)
+ 	{
+ 		i = 0;
+ 		pos = 0.0;
+ 	}
+ 	else
+ 	{
+ 		/* interpolate length1's position in the bin */
+ 		pos = get_len_position(length1,
+ 							   DatumGetFloat8(length_hist_values[i]),
+ 							   DatumGetFloat8(length_hist_values[i + 1]));
+ 	}
+ 	PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
+ 	B = length1;
+ 
+ 	/*
+ 	 * In the degenerate case that length1 == length2, simply return P(length1).
+ 	 * This is not merely an optimization: if length1 == length2, we'd divide
+ 	 * by zero later on.
+ 	 */
+ 	if (length2 == length1)
+ 		return PB;
+ 
+ 	/*
+ 	 * Loop through all the bins, until we hit the last bin, the one that
+ 	 * contains the upper bound. (if lower and upper bounds are in the same
+ 	 * bin, this falls out immediately)
+ 	 */
+ 	area = 0.0;
+ 	for (; i < length_hist_nvalues - 1 && DatumGetFloat8(length_hist_values[i + 1]) < length2; i++)
+ 	{
+ 		/* the upper bound of previous bin is the lower bound of this bin */
+ 		A = B; PA = PB;
+ 
+ 		B = DatumGetFloat8(length_hist_values[i + 1]);
+ 		PB = (double) i / (double) (length_hist_nvalues - 1);
+ 
+ 		area += 0.5 * (PB + PA) * (B - A);
+ 	}
+ 
+ 	/* Last bin */
+ 	A = B; PA = PB;
+ 
+ 	B = length2; /* last bin ends at the query upper bound */
+ 	if (i >= length_hist_nvalues - 1)
+ 		pos = 0.0;
+ 	else
+ 	{
+ 		if (DatumGetFloat8(length_hist_values[i]) == DatumGetFloat8(length_hist_values[i + 1]))
+ 			pos = 0.0;
+ 		else
+ 			pos = get_len_position(length2, DatumGetFloat8(length_hist_values[i]), DatumGetFloat8(length_hist_values[i + 1]));
+ 	}
+ 	PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
+ 
+ 	area += 0.5 * (PB + PA) * (B - A);
+ 
+ 	/*
+ 	 * Ok, we have calculated the area, ie. the integral. Divide by width to
+ 	 * get the requested average.
+ 	 */
+ 	frac = area / (length2 - length1);
+ 
+ 	return frac;
+ }
+ 
+ /*
+  * Calculate selectivity of "<@" operator using histograms of range lower bounds
+  * and histogram of range lengths.
+  *
+  * Note, this is "var <@ const", ie. estimate the fraction of ranges that
+  * fall within the constant lower and upper bounds.
+  *
+  * The caller has already checked that lower and upper are finite
+  */
+ static double
+ calc_hist_selectivity_contained(TypeCacheEntry *typcache,
+ 								RangeBound *lower, RangeBound *upper,
+ 								RangeBound *hist_lower,	int hist_nvalues,
+ 								Datum *length_hist_values, int length_hist_nvalues)
+ {
+ 	int			i,
+ 				upper_index;
+ 	float8		prev_dist;
+ 	double		bin_width;
+ 	double		upper_frac;
+ 	double		sum_frac;
+ 
+ 	/*
+ 	 * Begin by finding the bin containing the upper bound, in the lower bound
+ 	 * histogram. Any range with a lower bound > constant upper bound can't
+ 	 * match, ie. there are no matches in bins greater than upper_index.
+ 	 */
+ 	upper->inclusive = !upper->inclusive;
+ 	upper->lower = true;
+ 	upper_index = rbound_bsearch(typcache, upper, hist_lower, hist_nvalues, false);
+ 
+ 	/*
+ 	 * Calculate upper_frac -- fraction of (upper_index, upper_index + 1) bin
+ 	 * which is greater than upper bound of query range using linear
+ 	 * interpolation of subdiff function. Adjust selectivity with it.
+ 	 */
+ 	if (upper_index >= 0 && upper_index < hist_nvalues - 1)
+ 		upper_frac = get_position(typcache, upper, &hist_lower[upper_index],
+ 								  &hist_lower[upper_index + 1]);
+ 	else
+ 		upper_frac = 0.0;
+ 
+ 	/*
+ 	 * bin_width represents the width of the current bin. Normally it is 1.0,
+ 	 * meaning a full width bin, but can be less in the corner cases: start
+ 	 * and end of the loop. We start with hist_frac = upper_frac, because we
+ 	 * begin at the bin containing the upper bound.
+ 	 *
+ 	 * In the loop, dist and prev_dist are the distance of the "current" bin's
+ 	 * lower and upper bounds from the from the constant upper bound.
+ 	 */
+ 	bin_width = upper_frac;
+ 	prev_dist = 0.0;
+ 
+ 	sum_frac = 0.0;
+ 	for (i = upper_index; i >= 0; i--)
+ 	{
+ 		double		dist;
+ 		double		length_hist_frac;
+ 		bool		final_bin = false;
+ 
+ 		/*
+ 		 * dist -- distance from upper bound of query range to current
+ 		 * value of lower bound histogram or lower bound of query range (if
+ 		 * we've reached it).
+ 		 */
+ 		if (range_cmp_bounds(typcache, &hist_lower[i], lower) < 0)
+ 		{
+ 			dist = get_distance(typcache, lower, upper);
+ 			/*
+ 			 * Adjust bin width, as we are only taking into account a part of
+ 			 * this bin.
+ 			 */
+ 			bin_width *= get_position(typcache, lower, &hist_lower[i],
+ 									  &hist_lower[i + 1]);
+ 			final_bin = true;
+ 		}
+ 		else
+ 			dist = get_distance(typcache, &hist_lower[i], upper);
+ 
+ 		/*
+ 		 * Get average fraction of length histogram which covers intervals
+ 		 * longer than distance to upper bound of query range.
+ 		 */
+ 		length_hist_frac = calc_length_hist_frac(length_hist_values,
+ 			length_hist_nvalues, prev_dist, dist);
+ 
+ 		/*
+ 		 * Adjust overall selectivity according to current fraction of lower
+ 		 * bound histogram assuming not satisfying fraction of length
+ 		 * histogram to be average between this and previous steps.
+ 		 */
+ 		sum_frac += length_hist_frac * bin_width / (double) (hist_nvalues - 1);
+ 
+ 		if (final_bin)
+ 			break;
+ 
+ 		bin_width = 1.0;
+ 		prev_dist = dist;
+ 	}
+ 
+ 	return sum_frac;
+ }
+ 
+ /*
+  * Calculate selectivity of "@>" operator using histograms of range lower bounds
+  * and histogram of range lengths.
+  *
+  * Note, this is "var @> const", ie. estimate the fraction of ranges that
+  * contain the constant lower and upper bounds.
+  */
+ static double
+ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
+ 							   RangeBound *lower, RangeBound *upper,
+ 							   RangeBound *hist_lower, int hist_nvalues,
+ 							   Datum *length_hist_values, int length_hist_nvalues)
+ {
+ 	int			i,
+ 				lower_index;
+ 	float8		hist_frac,
+ 				lower_frac;
+ 	double		sum_frac;
+ 	float8		prev_dist;
+ 
+ 	/* Find index corresponding to lower bound of query range. */
+ 	lower_index = rbound_bsearch(typcache, lower, hist_lower, hist_nvalues, false);
+ 
+ 	/*
+ 	 * Calculate lower_frac -- fraction of (lower_index, lower_index + 1) bin
+ 	 * which is greater than lower bound of query range using linear
+ 	 * interpolation of subdiff function. Adjust selectivity with it.
+ 	 */
+ 	if (lower_index >= 0 && lower_index < hist_nvalues - 1)
+ 		lower_frac = get_position(typcache, lower, &hist_lower[lower_index],
+ 								  &hist_lower[lower_index + 1]);
+ 	else
+ 		lower_frac = 0.0;
+ 
+ 	/*
+ 	 * We start from selec value of 0. And then iterate on lower bounds
+ 	 * histogram from lower value of query range it's beginning. At each
+ 	 * step we add fraction of values in histogram bin which length are so
+ 	 * that they should contain query range. We use linear interpolation
+ 	 * for achieving more accurate result.
+ 	 */
+ 	prev_dist = get_distance(typcache, lower, upper);
+ 	sum_frac = 0.0;
+ 	hist_frac = lower_frac;
+ 	for (i = lower_index; i >= 0; i--)
+ 	{
+ 		float8		dist;
+ 		double		length_hist_frac;
+ 
+ 		/*
+ 		 * dist -- distance from upper bound of query range to current
+ 		 * value of lower bound histogram or lower bound of query range (if
+ 		 * we've reach it).
+ 		 */
+ 		dist = get_distance(typcache, &hist_lower[i], upper);
+ 
+ 		/*
+ 		 * Get average fraction of length histogram which covers intervals
+ 		 * longer than distance to upper bound of query range.
+ 		 */
+ 		length_hist_frac =
+ 			1.0 - calc_length_hist_frac(length_hist_values,
+ 										length_hist_nvalues,
+ 										prev_dist, dist);
+ 
+ 		sum_frac += length_hist_frac * hist_frac / (double) (hist_nvalues - 1);
+ 
+ 		hist_frac = 1.0;
+ 		prev_dist = dist;
+ 	}
+ 
+ 	return sum_frac;
+ }
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
***************
*** 29,34 ****
--- 29,36 ----
  #include "utils/builtins.h"
  #include "utils/rangetypes.h"
  
+ static int float8_qsort_cmp(const void *a1, const void *a2);
+ static int range_bound_qsort_cmp(const void *a1, const void *a2, void *arg);
  static void compute_range_stats(VacAttrStats *stats,
  		   AnalyzeAttrFetchFunc fetchfunc, int samplerows, double totalrows);
  
***************
*** 57,62 **** range_typanalyze(PG_FUNCTION_ARGS)
--- 59,84 ----
  }
  
  /*
+  * Comparison function for float8 which are used for measurement of range
+  * size.
+  */
+ static int
+ float8_qsort_cmp(const void *a1, const void *a2)
+ {
+ 	const float8 *f1,
+ 			   *f2;
+ 
+ 	f1 = (const float8 *) a1;
+ 	f2 = (const float8 *) a2;
+ 	if (*f1 < *f2)
+ 		return -1;
+ 	else if (*f1 == *f2)
+ 		return 0;
+ 	else
+ 		return 1;
+ }
+ 
+ /*
   * Comparison function for sorting RangeBounds.
   */
  static int
***************
*** 77,82 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 99,105 ----
  					int samplerows, double totalrows)
  {
  	TypeCacheEntry *typcache = (TypeCacheEntry *) stats->extra_data;
+ 	bool		has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
  	int			null_cnt = 0;
  	int			non_null_cnt = 0;
  	int			non_empty_cnt = 0;
***************
*** 85,94 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 108,119 ----
  	int			slot_idx;
  	int			num_bins = stats->attr->attstattarget;
  	int			num_hist;
+ 	float8	   *lengths;
  	RangeBound *lowers, *uppers;
  	double		total_width = 0;
  
  	/* Allocate memory for arrays of range bounds. */
+ 	lengths = (float8 *) palloc(sizeof(float8) * samplerows);
  	lowers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
  	uppers = (RangeBound *) palloc(sizeof(RangeBound) * samplerows);
  
***************
*** 101,106 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 126,132 ----
  		RangeType  *range;
  		RangeBound	lower,
  					upper;
+ 		float8		length;
  
  		vacuum_delay_point();
  
***************
*** 122,127 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 148,179 ----
  		range = DatumGetRangeType(value);
  		range_deserialize(typcache, range, &lower, &upper, &empty);
  
+ 		if (empty)
+ 		{
+ 			/* Length of empty range is zero */
+ 			length = 0.0;
+ 		}
+ 		else if (lower.infinite || upper.infinite)
+ 		{
+ 			/* Length of any kind of infinite range is initite */
+ 			length = get_float8_infinity();
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * For ordinal range use subdiff function between upper and lower
+ 			 * bound values or use default value of 1.0 if no subdiff is
+ 			 * available.
+ 			 */
+ 			if (has_subdiff)
+ 				length = DatumGetFloat8(FunctionCall2Coll(
+ 												&typcache->rng_subdiff_finfo,
+ 												typcache->rng_collation,
+ 												upper.val, lower.val));
+ 			else
+ 				length = 1.0;
+ 		}
+ 
  		if (!empty)
  		{
  			/* Fill bound values for further usage in histograms */
***************
*** 132,137 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 184,190 ----
  		else
  			empty_cnt++;
  
+ 		lengths[non_null_cnt] = length;
  		non_null_cnt++;
  	}
  
***************
*** 141,146 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
--- 194,200 ----
  	if (non_null_cnt > 0)
  	{
  		Datum	   *bound_hist_values;
+ 		Datum	   *length_hist_values;
  		int			pos,
  					posfrac,
  					delta,
***************
*** 210,219 **** compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
  			slot_idx++;
  		}
  
  		/* Store the fraction of empty ranges */
  		emptyfrac = (float4 *) palloc(sizeof(float4));
  		*emptyfrac = ((double) empty_cnt) / ((double) non_null_cnt);
! 		stats->stakind[slot_idx] = STATISTIC_KIND_RANGE_EMPTY_FRAC;
  		stats->stanumbers[slot_idx] = emptyfrac;
  		stats->numnumbers[slot_idx] = 1;
  		slot_idx++;
--- 264,331 ----
  			slot_idx++;
  		}
  
+ 		stats->stakind[slot_idx] = STATISTIC_KIND_LENGTH_HISTOGRAM;
+ 		/*
+ 		 * Generate a length histogram slot entry if we've collected at least
+ 		 * two length values which can be not distinct.
+ 		 */
+ 		if (non_null_cnt >= 2)
+ 		{
+ 			/*
+ 			 * Ascending sort of range lengths for further filling of
+ 			 * histogram
+ 			 */
+ 			qsort(lengths, non_null_cnt, sizeof(float8), float8_qsort_cmp);
+ 
+ 			num_hist = non_null_cnt;
+ 			if (num_hist > num_bins)
+ 				num_hist = num_bins + 1;
+ 
+ 			length_hist_values = (Datum *) palloc(num_hist * sizeof(Datum));
+ 
+ 			/*
+ 			 * The object of this loop is to copy the first and last lengths[]
+ 			 * entries along with evenly-spaced values in between. So the i'th
+ 			 * value is lengths[(i * (nvals - 1)) / (num_hist - 1)]. But
+ 			 * computing that subscript directly risks integer overflow when the
+ 			 * stats target is more than a couple thousand.  Instead we add
+ 			 * (nvals - 1) / (num_hist - 1) to pos at each step, tracking the
+ 			 * integral and fractional parts of the sum separately.
+ 			 */
+ 			delta = (non_null_cnt - 1) / (num_hist - 1);
+ 			deltafrac = (non_null_cnt - 1) % (num_hist - 1);
+ 			pos = posfrac = 0;
+ 
+ 			for (i = 0; i < num_hist; i++)
+ 			{
+ 				length_hist_values[i] = Float8GetDatum(lengths[pos]);
+ 				pos += delta;
+ 				posfrac += deltafrac;
+ 				if (posfrac >= (num_hist - 1))
+ 				{
+ 					/* fractional part exceeds 1, carry to integer part */
+ 					pos++;
+ 					posfrac -= (num_hist - 1);
+ 				}
+ 			}
+ 			stats->staop[slot_idx] = Float8LessOperator;
+ 			stats->stavalues[slot_idx] = length_hist_values;
+ 			stats->numvalues[slot_idx] = num_hist;
+ 			/* We are storing float8 values */
+ 			stats->statypid[slot_idx] = FLOAT8OID;
+ 			stats->statyplen[slot_idx] = sizeof(float8);
+ #ifdef USE_FLOAT8_BYVAL
+ 			stats->statypbyval[slot_idx] = true;
+ #else
+ 			stats->statypbyval[slot_idx] = false;
+ #endif
+ 			stats->statypalign[slot_idx] = 'd';
+ 		}
+ 
  		/* Store the fraction of empty ranges */
  		emptyfrac = (float4 *) palloc(sizeof(float4));
  		*emptyfrac = ((double) empty_cnt) / ((double) non_null_cnt);
! 
  		stats->stanumbers[slot_idx] = emptyfrac;
  		stats->numnumbers[slot_idx] = 1;
  		slot_idx++;
*** a/src/include/catalog/pg_operator.h
--- b/src/include/catalog/pg_operator.h
***************
*** 527,532 **** DATA(insert OID = 671 (  "<>"	   PGNSP PGUID b f f	701  701	16 671 670 float8ne
--- 527,533 ----
  DESCR("not equal");
  DATA(insert OID = 672 (  "<"	   PGNSP PGUID b f f	701  701	16 674 675 float8lt scalarltsel scalarltjoinsel ));
  DESCR("less than");
+ #define Float8LessOperator	672
  DATA(insert OID = 673 (  "<="	   PGNSP PGUID b f f	701  701	16 675 674 float8le scalarltsel scalarltjoinsel ));
  DESCR("less than or equal");
  DATA(insert OID = 674 (  ">"	   PGNSP PGUID b f f	701  701	16 672 673 float8gt scalargtsel scalargtjoinsel ));
*** a/src/include/catalog/pg_statistic.h
--- b/src/include/catalog/pg_statistic.h
***************
*** 269,279 **** typedef FormData_pg_statistic *Form_pg_statistic;
  #define STATISTIC_KIND_DECHIST	5
  
  /*
!  * An "empty frac" slot describes the fraction of empty ranges in a range-type
!  * column.  stavalues is not used and should be NULL.  stanumbers contains a
!  * single entry, the fraction of empty ranges (0.0 to 1.0).
   */
! #define STATISTIC_KIND_RANGE_EMPTY_FRAC  6
  
  /*
   * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
--- 269,285 ----
  #define STATISTIC_KIND_DECHIST	5
  
  /*
!  * A "length histogram" slot describes the distribution of range length in rows
!  * of an array-type column. Only non-null rows are considered. We use subdiff
!  * function for measure lengths of non-empty and non-infinite range. That's why
!  * we store lengths in stavalues of float8[] type. The preceding M (>=2) members
!  * form a histogram that divides the population of lengths into M-1 bins of
!  * approximately equal population. The first of these is the minimum observed
!  * length, and the last the maximum. Contain 1 stanumber - fraction of empty
!  * ranges. If typanalyze function finds only one non-NULL range, it fills only
!  * stanumber, while stavalues would be NULL.
   */
! #define STATISTIC_KIND_LENGTH_HISTOGRAM	6
  
  /*
   * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Statistics and selectivity estimation for ranges

Reply via email to