Re: [HACKERS] PATCH: adaptive ndistinct estimator v4

Tomas Vondra Wed, 17 Jun 2015 07:48:56 -0700

Hi,

On 05/13/15 23:07, Jeff Janes wrote:

With the warning it is very hard to correlate the discrepancy you do
see with which column is causing it, as the warnings don't include
table or column names (Assuming of course that you run it on a
substantial database--if you just run it on a few toy cases then the
warning works well).

That's true. I've added attnum/attname to the warning in the attachedversion of the patch.

If we want to have an explicitly experimental patch which we want
people with interesting real-world databases to report back on, what
kind of patch would it have to be to encourage that to happen? Or are
we never going to get such feedback no matter how friendly we make
it? Another problem is that you really need to have the gold standard
to compare them to, and getting that is expensive (which is why we
resort to sampling in the first place). I don't think there is much
to be done on that front other than bite the bullet and just do
it--perhaps only for the tables which have discrepancies.

Not sure. The "experimental" part of the patch was not really aimed atthe users outside the development community - it was meant to be used bymembers of the community, possibly testing it on customer databases Idon't think adding the GUC into the final release is a good idea, it'sjust a noise in the config no-one would actually use.

Some of the regressions I've seen are at least partly a bug:

+   /* find the 'm' value minimizing the difference */
+   for (m = 1; m <= total_rows; m += step)
+   {
+       double q = k / (sample_rows * m);

sample_rows and m are both integers, and their product overflows
vigorously. A simple cast to double before the multiplication fixes
the first example I produced. The estimate goes from 137,177 to
1,108,076. The reality is 1,062,223.

Perhaps m should be just be declared a double, as it is frequently
used in double arithmetic.

Yeah, I just discovered this bug independently. There's another bug thatthe adaptive_estimator takes total_rows as integer, so it breaks fortables with more than INT_MAX rows. Both are fixed in the v5.


        Therefore, I think that:

        1. This should be committed near the beginning of a release cycle,
        not near the end. That way, if there are problem cases, we'll have a
        year or so of developer test to shake them out.


It can't hurt, but how effective will it be? Will developers know or
care whether ndistinct happened to get better or worse while they
are working on other things? I would think that problems will be
found by focused testing, or during beta, and probably not by
accidental discovery during the development cycle. It can't hurt, but
I don't know how much it will help.

I agree with that - it's unlikely the regressions will get discoveredrandomly. OTOH I'd expect non-trivial number of people on this list tohave a few examples of ndistinct failures, and testing those would bemore useful I guess. But that's unlikely to find the cases that workedOK before and got broken by the new estimator :-(

I agree with the "experimental GUC".  That way if hackers do happen to
see something suspicious, they can just turn it off and see what
difference it makes.  If they have to reverse out a patch from 6 months
ago in an area of the code they aren't particularly interested in and
then recompile their code and then juggle two different sets of
binaries, they will likely just shrug it off without investigation.


+1


--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..f030702 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -16,6 +16,7 @@
 
 #include <math.h>
 
+#include "access/hash.h"
 #include "access/multixact.h"
 #include "access/transam.h"
 #include "access/tupconvert.h"
@@ -97,6 +98,9 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static double adaptive_estimator(double total_rows, int sample_rows,
+								int *f, int f_max);
+static int hash_comparator(const void *a, const void *b);
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -1698,6 +1702,7 @@ static void compute_scalar_stats(VacAttrStatsP stats,
 					 int samplerows,
 					 double totalrows);
 static int	compare_scalars(const void *a, const void *b, void *arg);
+static int	compare_scalars_simple(const void *a, const void *b, void *arg);
 static int	compare_mcvs(const void *a, const void *b);
 
 
@@ -1816,6 +1821,23 @@ compute_minimal_stats(VacAttrStatsP stats,
 	StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
 
 	/*
+	 * The adaptive ndistinct estimator requires counts for all the
+	 * repetition counts - we can't do the sort-based count directly
+	 * (because this handles data types with just = operator), and the
+	 * MCV-based counting seems insufficient. We'll instead compute
+	 * hash values, and sort those. We're using just 32-bit hashes,
+	 * which may result in a few collisions - for 30k rows (sampled
+	 * rows for default_statistics_target=100) there's 1:10 chance of
+	 * a hash collision (assuming all values are distinct). But this
+	 * seems like a small error compared to the other factors involved
+	 * (sampling, ...) or compared to the MCV-based counting.
+	 */
+	uint32	   *hashes = (uint32*)palloc0(samplerows * sizeof(uint32));
+
+	/* number of computed hashes (technically equal to nonnull_cnt) */
+	int			nhashes = 0;
+
+	/*
 	 * We track up to 2*n values for an n-element MCV list; but at least 10
 	 */
 	track_max = 2 * num_mcv;
@@ -1876,6 +1898,36 @@ compute_minimal_stats(VacAttrStatsP stats,
 			total_width += strlen(DatumGetCString(value)) + 1;
 		}
 
+		/* compute the hash value, depending on the data type kind */
+		if (stats->attrtype->typbyval)
+		{
+			/* simple pass-by-value data type, with 'typlen' bytes */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) &value,
+										  stats->attrtype->typlen));
+		}
+		else if (is_varlena)
+		{
+			/* regular varlena data type */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) VARDATA_ANY(value),
+										  VARSIZE_ANY_EXHDR(DatumGetPointer(value))));
+		}
+		else if (is_varwidth)
+		{
+			/* pass-by-reference with a variable length (e.g. cstring) */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) DatumGetCString(value),
+										  strlen(DatumGetCString(value))));
+		}
+		else
+		{
+			/* pass-by-reference with fixed length (e.g. name) */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) DatumGetCString(value),
+										  stats->attrtype->typlen));
+		}
+
 		/*
 		 * See if the value matches anything we're already tracking.
 		 */
@@ -1931,6 +1983,43 @@ compute_minimal_stats(VacAttrStatsP stats,
 		int			nmultiple,
 					summultiple;
 
+		/* values needed by the adaptive ndistinct estimator */
+		int			f_max = 0;
+		int		   *f_count = (int*)palloc0(sizeof(int) * (nhashes + 1));
+		int			prev_index;
+
+		/* sort the hashes and then count the repetitions */
+		qsort(hashes, nhashes, sizeof(uint32), hash_comparator);
+
+		/*
+		 * Counting repetitions - walk through the sorted array, compare
+		 * the value to the previous one, and whenever it changes the
+		 * we can compute the repetitions using the array indexes.
+		 */
+		prev_index = 0;
+		for (i = 1; i < nhashes; i++)
+		{
+			/* the hashes are different - store the repetition count */
+			if (hashes[i] != hashes[i-1])
+			{
+				f_count[i - prev_index] += 1;
+
+				if (f_max < (i - prev_index))
+					f_max = (i - prev_index);
+
+				prev_index = i;
+			}
+		}
+
+		/* the last element is not updated in the loop */
+		f_count[nhashes - prev_index] += 1;
+
+		if (f_max < (nhashes - prev_index))
+			f_max = (nhashes - prev_index);
+
+		/* wide values are assumed to be distinct */
+		f_count[1] += toowide_cnt;
+
 		stats->stats_valid = true;
 		/* Do the simple null-frac and width stats */
 		stats->stanullfrac = (double) null_cnt / (double) samplerows;
@@ -1988,6 +2077,7 @@ compute_minimal_stats(VacAttrStatsP stats,
 			double		numer,
 						denom,
 						stadistinct;
+			double		adaptdistinct;
 
 			numer = (double) samplerows *(double) d;
 
@@ -2001,6 +2091,31 @@ compute_minimal_stats(VacAttrStatsP stats,
 			if (stadistinct > totalrows)
 				stadistinct = totalrows;
 			stats->stadistinct = floor(stadistinct + 0.5);
+
+			/*
+			 * When computing the adaptive estimate, we're only considering
+			 * non-null values, so we need to perform correction of the
+			 * total rows / sample rows to reflect this. Otherwise the
+			 * coefficients (f_count / f_max) are out of sync. We could
+			 * probably do the inverse thing (including NULL values into
+			 * f_count) with the same effect.
+			 */
+			adaptdistinct
+				= adaptive_estimator(totalrows * (nonnull_cnt / (double)samplerows),
+									nonnull_cnt, f_count, f_max);
+
+			elog(WARNING, "ndistinct estimate attnum=%d attname=%s current=%.2f adaptive=%.2f",
+						  stats->attr->attnum, NameStr(stats->attr->attname),
+						  stadistinct, adaptdistinct);
+
+			/* if we've seen 'almost' all rows, use the estimate instead */
+			if (samplerows >= 0.95 * totalrows)
+			{
+				adaptdistinct = (d + d/0.95)/2;
+				elog(WARNING, "  corrected ndistinct estimate current=%.2f adaptive=%.2f",
+					 stadistinct, adaptdistinct);
+			}
+
 		}
 
 		/*
@@ -2225,6 +2340,12 @@ compute_scalar_stats(VacAttrStatsP stats,
 		int			slot_idx = 0;
 		CompareScalarsContext cxt;
 
+		/* f values for the estimator - messy and we likely need much
+		 * less memory, but who cares */
+		int			f_max = 0; /* max number of duplicates */
+		int		   *f_count = (int*)palloc0(sizeof(int)*(values_cnt+1));
+		int			first_index = 0;	/* first index of a group */
+
 		/* Sort the collected values */
 		cxt.ssup = &ssup;
 		cxt.tupnoLink = tupnoLink;
@@ -2295,6 +2416,35 @@ compute_scalar_stats(VacAttrStatsP stats,
 			}
 		}
 
+		/*
+		 * Counting repetitions - walk through the sorted array, compare
+		 * the value to the previous one, and whenever it changes the
+		 * we can compute the repetitions using the array indexes.
+		 */
+		for (i = 1; i < values_cnt; i++)
+		{
+			/* the hashes are different - store the repetition count */
+			if (compare_scalars_simple(&values[i], &values[first_index], &cxt) != 0)
+			{
+				/* found first element of the following group, so (i-first) is the count */
+				f_count[i - first_index] += 1;
+
+				if (f_max < (i - first_index))
+					f_max = (i - first_index);
+
+				first_index = i;
+			}
+		}
+
+		/* the last element is not updated in the loop */
+		f_count[values_cnt - first_index] += 1;
+
+		if (f_max < (values_cnt - first_index))
+			f_max = (values_cnt - first_index);
+
+		/* compensate for wide values (assumed to be distinct) */
+		f_count[1] += toowide_cnt;
+
 		stats->stats_valid = true;
 		/* Do the simple null-frac and width stats */
 		stats->stanullfrac = (double) null_cnt / (double) samplerows;
@@ -2337,6 +2487,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 			double		numer,
 						denom,
 						stadistinct;
+			double		adaptdistinct;	/* adaptive estimate */
 
 			numer = (double) samplerows *(double) d;
 
@@ -2350,6 +2501,30 @@ compute_scalar_stats(VacAttrStatsP stats,
 			if (stadistinct > totalrows)
 				stadistinct = totalrows;
 			stats->stadistinct = floor(stadistinct + 0.5);
+
+			/*
+			 * When computing the adaptive estimate, we're only considering
+			 * non-null values, so we need to perform correction of the
+			 * total rows / sample rows to reflect this. Otherwise the
+			 * coefficients (f_count / f_max) are out of sync. We could
+			 * probably do the inverse thing (including NULL values into
+			 * f_count) with the same effect.
+			 */
+			adaptdistinct
+				= adaptive_estimator(totalrows * (nonnull_cnt / (double)samplerows),
+									nonnull_cnt, f_count, f_max);
+
+			elog(WARNING, "ndistinct estimate attnum=%d attname=%s current=%.2f adaptive=%.2f",
+						  stats->attr->attnum, NameStr(stats->attr->attname),
+						  stadistinct, adaptdistinct);
+
+			/* if we've seen 'almost' all rows, use the estimate instead */
+			if (samplerows >= 0.95 * totalrows)
+			{
+				adaptdistinct = (d + d/0.95)/2;
+				elog(WARNING, "  corrected ndistinct estimate current=%.2f adaptive=%.2f",
+					 stadistinct, adaptdistinct);
+			}
 		}
 
 		/*
@@ -2665,6 +2840,21 @@ compare_scalars(const void *a, const void *b, void *arg)
 }
 
 /*
+ * qsort_arg comparator for sorting ScalarItems
+ *
+ */
+static int
+compare_scalars_simple(const void *a, const void *b, void *arg)
+{
+	Datum		da = ((const ScalarItem *) a)->value;
+	Datum		db = ((const ScalarItem *) b)->value;
+
+	CompareScalarsContext *cxt = (CompareScalarsContext *) arg;
+
+	return ApplySortComparator(da, false, db, false, cxt->ssup);
+}
+
+/*
  * qsort comparator for sorting ScalarMCVItems by position
  */
 static int
@@ -2675,3 +2865,131 @@ compare_mcvs(const void *a, const void *b)
 
 	return da - db;
 }
+
+/*
+ * Implements AEE ndistinct estimator, desctibed in the paper "Towards
+ * Estimation Error Guarantees for Distinct Values, ACM 2000" paper, with
+ * a minor tweak for highly skewed distributions, where the AEE is quite
+ * unstable. In those cases (when either f1 or f2 are 0) we simply use
+ * the GEE estimator, which seems to work better in those cases.
+ *
+ * The AEE estimator is based on solving this equality (for "m")
+ *
+ *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
+ *
+ * Instead of looking for the exact solution to this equation (which
+ * might be fractional), we'll look for a natural number minimizing
+ * the absolute difference. Number of (distinct) elements is a natural
+ * number, and we don't mind if the number is slightly wrong. It's
+ * just an estimate, after all. The error from sampling will be much
+ * worse in most cases.
+ *
+ * We know the acceptable values of 'm' are [d,N] where 'd' is the number
+ * of distinct elements in the sample, and N is the number of rows in
+ * the table (not just the sample). For large tables (billions of rows)
+ * that'd be quite time-consuming to compute, so we'll approximate the
+ * solution by gradually increasing the step to ~1% of the current value
+ * of 'm'. This will make it much faster and yet very accurate.
+ *
+ * All this of course assumes the function behaves reasonably (not
+ * oscillating etc.), but that's a safe assumption as the estimator
+ * would perform terribly otherwise (and we're falling back to GEE for
+ * cases where this behavior could happen).
+ */
+static double
+adaptive_estimator(double total_rows, int sample_rows, int *f, int f_max)
+{
+	int		i;
+	double	m;
+	double	A = 0.0, B = 0.0;
+	double	opt_m = 0;
+	double	opt_diff = total_rows;
+	double	step = 1.0;
+	double	ndistinct;
+	int		d = f[1] + f[2];
+
+	double k = (f[1] + 2*f[2]);
+
+	/* for highly-skewed distributions, the GEE works better */
+	if (f[1] == 0 || f[2] == 0)
+	{
+		d = f[1];
+		ndistinct = sqrt(total_rows / (double)sample_rows) * f[1];
+
+		for (i = 2; i <= f_max; i++)
+		{
+			ndistinct += f[i];
+			d += f[i];
+		}
+
+		/* sanity checks that the estimate is within [d,total_rows] */
+		if (ndistinct < d)
+			ndistinct = d;
+		else if (ndistinct > total_rows / sample_rows * d)
+			ndistinct = total_rows / sample_rows * d;
+
+		return ndistinct;
+	}
+
+	/* compute the 'constant' parts of the equality (A, B) */
+	for (i = 3; i <= f_max; i++)
+	{
+		double p = i / (double)sample_rows;
+		A +=     powl((1.0 - p), sample_rows  ) * f[i];
+		B += i * powl((1.0 - p), sample_rows-1) * f[i];
+		d += f[i];
+	}
+
+	/* find the 'm' value minimizing the difference */
+	for (m = 1; m <= total_rows; m += step)
+	{
+		double q = k / (sample_rows * m);
+
+		double A_m = A + m * powl((1 - q), sample_rows  );
+		double B_m = B + k * powl((1 - q), sample_rows-1);
+
+		double diff = fabsl(f[1] * (A_m / B_m) - (m - f[1] - f[2]));
+
+		/*
+		 * The function should have a single minimum - stop if we've passed
+		 * it, but not on the first element.
+		 */
+		if ((m > 1) && (opt_diff < diff))
+			break;
+
+		opt_diff = diff;
+		opt_m = m;
+
+		/* tweak the step to 1% to make it faster */
+		step = ((int)(0.01 * m) > step) ? (int)(0.01 * m) : step;
+	}
+
+	/* compute the final estimate */
+	ndistinct = d + opt_m - f[1] - f[2];
+
+	/* sanity checks that the estimate is within [d,total_rows] */
+	if (ndistinct < d)
+		ndistinct = d;
+	else if (ndistinct > total_rows / sample_rows * d)
+		ndistinct = total_rows / sample_rows * d;
+
+	return ndistinct;
+}
+
+static int
+hash_comparator(const void *a, const void *b)
+{
+	uint32 ai = *(uint32*)a;
+	uint32 bi = *(uint32*)b;
+	if (ai < bi)
+		return -1;
+	else if (ai > bi)
+		return 1;
+	else
+		return 0;
+}

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: adaptive ndistinct estimator v4

Reply via email to