Re: planner support functions: handle GROUP BY estimates ?

Tomas Vondra Tue, 14 Jan 2020 12:54:45 -0800

On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:

Justin Pryzby <[email protected]> writes:

On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:

Tom implemented "Planner support functions":
https://www.postgresql.org/docs/12/xfunc-optimization.html
I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by".  That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg.  date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc().  Currently, the
...
If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.


While I don't have any objection in principle to extending the set of
things planner support functions can do, it doesn't seem like the idea is
giving you all that much traction for this problem.  There isn't that much
knowledge that's specific to date_trunc in this, and instead you've got a
bunch of generic problems (that would have to be solved again in every
other function's planner support).

Another issue is that it seems like this doesn't compose nicely ---
if the GROUP BY expression is "f(g(x))", how do f's support function
and g's support function interact?

The direction that I've been wanting to go in for this kind of problem
is to allow CREATE STATISTICS on an expression, ie if you were concerned
about the estimation accuracy for GROUP BY or anything else, you could do
something like

CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table;

This would have the effect of cueing ANALYZE to gather stats on the
value of that expression, which the planner could then use, very much
as if you'd created an index on the expression.  The advantages of
doing this rather than making an index are

(1) you don't have to pay the maintenance costs for an index,

(2) we don't have to restrict it to immutable expressions.  (Volatile
expressions would have to be disallowed, if only because of fear of
side-effects; but I think we could allow stable expressions just fine.
Worst case problem is that the stats are stale, but so what?)

With a solution like this, we don't have to solve any of the difficult
problems of how the pieces of the expression interact with each other
or with the statistics of the underlying column(s).  We just use the
stats if available, and the estimate will be as good as it'd be for
a plain column reference.

I'm not sure how much new infrastructure would have to be built
for this.  We designed the CREATE STATISTICS syntax to support
this (partly at my insistence IIRC) but I do not think any of the
existing plumbing is ready for it.  I don't think it'd be very
hard to plug this into ANALYZE or the planner, but there might be
quite some work to be done on the catalog infrastructure, pg_dump,
etc.

cc'ing Tomas in case he has any thoughts about it.


Well, I certainly do thoughts about this - it's pretty much exactly what
I proposed yesterday in this thread:

  
https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development

The third part of that patch series is exactly about supporting extended
statistics on expressions, about the way you described here. The current
status of the WIP patch is that grammar + ANALYZE mostly works, but
there is no support in the planner. It's obviously still very hackish.

The main thing I'm not sure about is how to represent this in catalogs,
whether to have two fields (like for indexes) or maybe a single list of
expressions.


I'm also wondering if we could/should 100% rely on extended statistics,
because those are really meant to track correlations between columns,
which means we currently require at least two attributes in CREATE
STATISTICS and so on. So maybe what we want is collecting "regular"
per-column stats just like we do for indexes, but without the index
maintenance overhead?

The advantage would be we'd get exactly the same stats as for indexes,
and we could use them in the same places out of the box. While with
extended stats we'll have to tweak those places.

Now, the trouble is we can't store stuff in pg_statistic without having

a relation (i.e. table / index / ...) but maybe we could invent a newrelation type for this purpose. Of course, it'd require some catalog

work to represent this ...


Ultimately I think we'd want both things, it's not one or the other.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: planner support functions: handle GROUP BY estimates ?

Reply via email to