On Tue, Mar 13, 2018 at 6:56 AM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote:

> On Tue, Mar 13, 2018 at 4:14 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > Mat Arye <m...@timescale.com> writes: > >> So the use-case is an analytical query like > > > >> SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg > >> FROM hyper > >> WHERE time >= '2001-01-04T00:00:00' AND time <= '2001-01-05T01:00:00' > >> GROUP BY MetricMinuteTs > >> ORDER BY MetricMinuteTs DESC; > > > >> Right now this query will choose a much-less-efficient GroupAggregate > plan > >> instead of a HashAggregate. It will choose this because it thinks the > >> number of groups > >> produced here is 9,000,000 because that's the number of distinct time > >> values there are. > >> But, because date_trunc "buckets" the values there will be about 24 > groups > >> (1 for each hour). > > > > While it would certainly be nice to have better behavior for that, > > "add a hook so users who can write C can fix it by hand" doesn't seem > > like a great solution. On top of the sheer difficulty of writing a > > hook function, you'd have the problem that no pre-written hook could > > know about all available functions. I think somehow we'd need a way > > to add per-function knowledge, perhaps roughly like the protransform > > feature. > > Like cost associated with a function, we may associate mapping > cardinality with a function. It tells how many distinct input values > map to 1 output value. By input value, I mean input argument tuple. In > Mat's case the mapping cardinality will be 12. The number of distinct > values that function may output is estimated as number of estimated > rows / mapping cardinality of that function. > I think this is complicated by the fact that the mapping cardinality is not a constant per function but depends on the constant given as the first argument to the function and the granularity of the underlying data (do you have a second-granularity or microsecond granularity). I actually think the logic for the estimate here should be the (max(time)-min(time))/interval. I think to be general you need to allow functions on statistics to determine the estimate. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company >