Justin Pryzby <pry...@telsasoft.com> writes:
> On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:
>> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
>>> Tom implemented "Planner support functions":
>>> https://www.postgresql.org/docs/12/xfunc-optimization.html
>>> I wondered whether there was any consideration to extend that to allow
>>> providing improved estimates of "group by".  That currently requires 
>>> manually
>>> by creating an expression index, if the function is IMMUTABLE (which is not
>>> true for eg.  date_trunc of timestamptz).

>> I didn't hear back so tried implementing this for date_trunc().  Currently, 
>> the
>> ...
>> If the input timestamps have (say) hourly granularity, rowcount will be
>> *underestimated* by 3600x, which is worse than the behavior in master of
>> overestimating by (for "day") 24x.

While I don't have any objection in principle to extending the set of
things planner support functions can do, it doesn't seem like the idea is
giving you all that much traction for this problem.  There isn't that much
knowledge that's specific to date_trunc in this, and instead you've got a
bunch of generic problems (that would have to be solved again in every
other function's planner support).

Another issue is that it seems like this doesn't compose nicely ---
if the GROUP BY expression is "f(g(x))", how do f's support function
and g's support function interact?

The direction that I've been wanting to go in for this kind of problem
is to allow CREATE STATISTICS on an expression, ie if you were concerned
about the estimation accuracy for GROUP BY or anything else, you could do
something like

CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table;

This would have the effect of cueing ANALYZE to gather stats on the
value of that expression, which the planner could then use, very much
as if you'd created an index on the expression.  The advantages of
doing this rather than making an index are

(1) you don't have to pay the maintenance costs for an index,

(2) we don't have to restrict it to immutable expressions.  (Volatile
expressions would have to be disallowed, if only because of fear of
side-effects; but I think we could allow stable expressions just fine.
Worst case problem is that the stats are stale, but so what?)

With a solution like this, we don't have to solve any of the difficult
problems of how the pieces of the expression interact with each other
or with the statistics of the underlying column(s).  We just use the
stats if available, and the estimate will be as good as it'd be for
a plain column reference.

I'm not sure how much new infrastructure would have to be built
for this.  We designed the CREATE STATISTICS syntax to support
this (partly at my insistence IIRC) but I do not think any of the
existing plumbing is ready for it.  I don't think it'd be very
hard to plug this into ANALYZE or the planner, but there might be
quite some work to be done on the catalog infrastructure, pg_dump,
etc.

cc'ing Tomas in case he has any thoughts about it.

                        regards, tom lane


Reply via email to