Re: planner support functions: handle GROUP BY estimates ?

Justin Pryzby Sun, 22 Dec 2019 16:17:39 -0800

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
> Tom implemented "Planner support functions":
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
> https://www.postgresql.org/docs/12/xfunc-optimization.html
> 
> I wondered whether there was any consideration to extend that to allow
> providing improved estimates of "group by".  That currently requires manually
> by creating an expression index, if the function is IMMUTABLE (which is not
> true for eg.  date_trunc of timestamptz).


I didn't hear back so tried implementing this for date_trunc().  Currently, the
planner assumes that functions output equally many groups as their input
variables.  Most invocations of our reports use date_trunc (or similar), so my
earlier attempt to alert on rowcount misestimates was very brief.

I currently assume that the input data has 1 second granularity:
|postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM 
generate_series(now(), now()+'7 day'::interval, '1 seconds')a; ANALYZE t;
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| Group  (cost=9021.85..9042.13 rows=169 width=8) (actual 
time=1365.934..1366.453 rows=169 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| Finalize HashAggregate  (cost=10172.79..10298.81 rows=10081 width=8) (actual 
time=1406.057..1413.413 rows=10081 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| Group  (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 
rows=8 loops=1)

If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

I'm trying to think of ways to address that:

0) Add a fudge factor of 4x or maybe 30x;

1) Avoid applying a corrective factor for seconds or minutes that makes the
rowcount less than (say) 2 or 100.  That would divide 24 but might then avoid
the last /60 or /60/60.  Ultimately, that's more "fudge" than anything else;

2) Leave alone pg_catalog.date_trunc(), but provide "template" support
functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the
given corrective factor, which should allow more accurate rowcount for input
data with granularity of the given number of seconds.

Ideally, that would be user-specified factor, but I don't think that's possible
to specify in SQL; the constant has to be built into the C function.  At
telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe
make a "date_trunc" function in the user schema which calls the
pg_catalog.date_trunc with support function timestamp_support_10pow3;

There could be a "base" support function that accepts a multiplier argument,
and then any user-provided C extension would be a one-liner specifing an
arbitrary value;

3) Maybe there are better functions than date_trunc() to address;

4) Leave it as a patch in the archives for people to borrow from;

Justin

>From 94f7791a7f82dea8757ef139befbf26feb970685 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <[email protected]>
Date: Sun, 15 Dec 2019 20:27:24 -0600
Subject: [PATCH v1] Planner support functions for GROUP BY f()..

..implemented for date_trunc()

See also a391ff3c3d418e404a2c6e4ff0865a107752827b
---
 src/backend/optimizer/util/plancat.c | 44 ++++++++++++++++
 src/backend/utils/adt/selfuncs.c     | 28 +++++++++++
 src/backend/utils/adt/timestamp.c    | 97 ++++++++++++++++++++++++++++++++++++
 src/include/catalog/catversion.h     |  2 +-
 src/include/catalog/pg_proc.dat      | 15 ++++--
 src/include/nodes/nodes.h            |  3 +-
 src/include/nodes/supportnodes.h     | 15 ++++++
 src/include/optimizer/plancat.h      |  2 +
 8 files changed, 201 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5e889d1..9438b3c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2008,6 +2008,50 @@ get_function_rows(PlannerInfo *root, Oid funcid, Node *node)
 	return result;
 }
 
+/* */
+double
+get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var)
+{
+	HeapTuple	proctup;
+	Form_pg_proc procform;
+
+	proctup = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcid));
+	if (!HeapTupleIsValid(proctup))
+		elog(ERROR, "cache lookup failed for function %u", funcid);
+	procform = (Form_pg_proc) GETSTRUCT(proctup);
+
+	if (OidIsValid(procform->prosupport))
+	{
+		SupportRequestGroupBy req;
+		SupportRequestGroupBy *sresult;
+
+		req.type = T_SupportRequestGroupBy;
+		req.root = root;
+		req.funcid = funcid;
+		req.node = node;
+		req.var = var;
+		req.factor = 1;			/* just for sanity */
+
+		sresult = (SupportRequestGroupBy *)
+			DatumGetPointer(OidFunctionCall1(procform->prosupport,
+											 PointerGetDatum(&req)));
+
+		if (sresult == &req)
+		{
+			/* Success */
+			ReleaseSysCache(proctup);
+			return req.factor;
+		}
+	}
+
+	/* XXX No support function, or it failed */
+
+	ReleaseSysCache(proctup);
+
+	return 1;
+}
+
+
 /*
  * has_unique_index
  *
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ff02b5a..eb0b86f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3154,10 +3154,38 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		 */
 		foreach(l2, varshere)
 		{
+			double ret;
 			Node	   *var = (Node *) lfirst(l2);
 
 			examine_variable(root, var, 0, &vardata);
 			varinfos = add_unique_group_var(root, varinfos, var, &vardata);
+
+			/* If we group by a function of a simple var, try to call its support function to help estimate GROUP BY */
+			if (HeapTupleIsValid(vardata.statsTuple) &&
+					IsA(groupexpr, FuncExpr) && IsA(var, Var))
+					// && (varRelid == 0 || varRelid == ((Var *) basenode)->varno))
+			{
+				Form_pg_statistic stats = (Form_pg_statistic)GETSTRUCT(vardata.statsTuple);
+				FuncExpr   *expr = (FuncExpr *) groupexpr;
+
+				Var *v = (Var*) var;
+				RangeTblEntry *rte = root->simple_rte_array[v->varno];
+				char *reln = get_rel_name(rte->relid);
+				char *coln = get_attname(rte->relid, v->varattno, true);
+				ret = get_function_groupby(root, expr->funcid, groupexpr, var);
+
+				fprintf(stderr, "HERE %s %d %s.%s ndistinct=%f ret=%f\n", __FUNCTION__, __LINE__,
+						reln?reln:"null",
+						coln?coln:"null",
+						stats->stadistinct, ret);
+
+				Assert(ret>=0);
+				Assert(ret<=1);
+				numdistinct *= ret;
+
+				/* Can we do anything special with stats->stadistinct that's not already done in general? */
+			}
+
 			ReleaseVariableStats(vardata);
 		}
 	}
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 945b8f8..359ad32 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -5554,3 +5554,100 @@ generate_series_timestamptz(PG_FUNCTION_ARGS)
 		SRF_RETURN_DONE(funcctx);
 	}
 }
+
+
+/*
+ * Planner support function for date_trunc
+ * Try to estimate the factor by which to correct the estimate of ngroups for GROUP BY.
+ */
+Datum
+date_trunc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq;
+	SupportRequestGroupBy *req;
+	List	   *args;
+	Node	   *arg1, *arg2;
+	Node		 	*var;
+	char		*start;
+	int			typmod;
+
+	rawreq = (Node *) PG_GETARG_POINTER(0);
+	if (!IsA(rawreq, SupportRequestGroupBy))
+		PG_RETURN_POINTER(NULL);
+
+	req = (SupportRequestGroupBy *) rawreq;
+	if (!is_funcclause(req->node))	/* be paranoid */
+		PG_RETURN_POINTER(NULL);
+
+	args = ((FuncExpr *) req->node)->args;
+	arg1 = linitial(args);
+	arg2 = lsecond(args);
+	/* arg3 may be the timezone */
+
+	var = req->var;
+
+	/* XXX Assumes the input has 1-second granularity */
+
+	// XXX: only work on const?
+	start = TextDatumGetCString(((Const*)arg1)->constvalue);
+
+	// XXX: not working due to promotion ?
+	// exprType(var) is still not right, since date_trunc(a, b::date) uses b's type and not date..
+	// ...but date_trunc('', x::date) is weird
+	// exprType(arg2)==TIMESTAMPOID || exprType(arg2)==TIMESTAMPTZOID)
+	// if (req->funcid==1217 || req->funcid==1284 || req->funcid==2020)
+	/* Reset if it's not a timestamp */
+	if (exprType(var)==DATEOID) {
+		req->factor = 60*60*24;
+	} else if (exprType(var)==TIMESTAMPOID || exprType(var)==TIMESTAMPTZOID) {
+		int typmod = exprTypmod(arg2); // XXX: vartypmod ?
+		/* If the input has decimal digits, the grouping effect is stronger */
+		if (typmod != -1) {
+			req->factor /= 2<<typmod;
+			if (strcmp(start, "microseconds")==0) {
+				/* do nothing? */
+			} else if (strcmp(start, "milliseconds")==0) {
+				/* do nothing? */
+			}
+		}
+
+		if (strcmp(start, "second")==0) {
+			/* do nothing */
+		} else if (strcmp(start, "minute")==0) {
+			req->factor /= 60;
+		} else if (strcmp(start, "hour")==0) {
+			req->factor /= 60*60;
+		}
+	}
+
+	// else { elog(ERROR, "unknown type %u", exprType(var)); }
+
+	if (strcmp(start, "day")==0) {
+		req->factor /= 60*60*24;
+	} else if (strcmp(start, "week")==0) {
+		req->factor /= 60*60*24*7;
+	} else if (strcmp(start, "month")==0) {
+		/* 30 days */
+		req->factor /= 60*60*24*30;
+	} else if (strcmp(start, "quarter")==0) {
+		req->factor /= 60*60*24*30*3;
+	} else if (strcmp(start, "year")==0) {
+		req->factor /= 60*60*24*365.24;
+	} else if (strcmp(start, "decade")==0) {
+		req->factor /= 60*60*24*365.25*10;
+	} else if (strcmp(start, "century")==0) {
+		req->factor /= 60*60*24*365.25*100;
+	} else if (strcmp(start, "millennium")==0) {
+		req->factor /= 60*60*24*365.25*1000;
+	} else if (req->factor > 1) {
+		/* Maybe a DATE with finer graularity trunc */
+		req->factor = 1;
+	}
+	// else { elog(ERROR, "", ); }
+
+	/* Fudge Factor, since the input may be already "grouped", say at multiples of 15min, */
+	/* or otherwise have course granularity to begin with */
+	// factor/=4;
+
+	PG_RETURN_POINTER(req);
+}
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eca67a1..e2c05be 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201911242
+#define CATALOG_VERSION_NO	201911243
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index aae50d6..7a8c3d1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -2350,11 +2350,14 @@
 { oid => '1217',
   descr => 'truncate timestamp with time zone to specified units',
   proname => 'date_trunc', provolatile => 's', prorettype => 'timestamptz',
-  proargtypes => 'text timestamptz', prosrc => 'timestamptz_trunc' },
+  proargtypes => 'text timestamptz', prosrc => 'timestamptz_trunc',
+  prosupport => 'date_trunc_support', },
 { oid => '1284',
   descr => 'truncate timestamp with time zone to specified units in specified time zone',
   proname => 'date_trunc', provolatile => 's', prorettype => 'timestamptz',
-  proargtypes => 'text timestamptz text', prosrc => 'timestamptz_trunc_zone' },
+  proargtypes => 'text timestamptz text', prosrc => 'timestamptz_trunc_zone',
+  prosupport => 'date_trunc_support', },
+# XXX:
 { oid => '1218', descr => 'truncate interval to specified units',
   proname => 'date_trunc', prorettype => 'interval',
   proargtypes => 'text interval', prosrc => 'interval_trunc' },
@@ -5632,7 +5635,13 @@
   proargtypes => 'timestamptz', prosrc => 'timestamptz_time' },
 { oid => '2020', descr => 'truncate timestamp to specified units',
   proname => 'date_trunc', prorettype => 'timestamp',
-  proargtypes => 'text timestamp', prosrc => 'timestamp_trunc' },
+  proargtypes => 'text timestamp', prosrc => 'timestamp_trunc',
+  prosupport => 'date_trunc_support', },
+
+{ oid => '5449', descr => 'planner support for date_trunc',
+  proname => 'date_trunc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'date_trunc_support', },
+
 { oid => '2021', descr => 'extract field from timestamp',
   proname => 'date_part', prorettype => 'float8',
   proargtypes => 'text timestamp', prosrc => 'timestamp_part' },
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bce2d59..499fed1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -513,7 +513,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition,	/* in nodes/supportnodes.h */
+	T_SupportRequestGroupBy,	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 460d75b..83f14a3 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -168,6 +168,21 @@ typedef struct SupportRequestRows
 	double		rows;			/* number of rows expected to be returned */
 } SupportRequestRows;
 
+/* How many fewer rows are output after GROUPing BY a function */
+typedef struct SupportRequestGroupBy
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	struct PlannerInfo *root;	/* Planner's infrastructure (could be NULL) */
+	Oid			funcid;			/* function we are inquiring about */
+	Node			*var;			/* original (2nd) argument */
+	Node	   *node;			/* parse node invoking function */
+
+	/* Output fields: */
+	double		factor;			/* fraction of rows: 0..1 */
+} SupportRequestGroupBy;
+
 /*
  * The IndexCondition request allows the support function to generate
  * a directly-indexable condition based on a target function call that is
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index bbb27f8..cbe5386 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -70,6 +70,8 @@ extern void add_function_cost(PlannerInfo *root, Oid funcid, Node *node,
 
 extern double get_function_rows(PlannerInfo *root, Oid funcid, Node *node);
 
+extern double get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var);
+
 extern bool has_row_triggers(PlannerInfo *root, Index rti, CmdType event);
 
 extern bool has_stored_generated_columns(PlannerInfo *root, Index rti);
-- 
2.7.4

Re: planner support functions: handle GROUP BY estimates ?

Reply via email to