jiacai2050 commented on issue #4809: URL: https://github.com/apache/arrow-datafusion/issues/4809#issuecomment-1377429211
> The timebucket_gap_fill function ([docs](https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/)) can take an optional start and finish arguments which perhaps offers a way to express this case (apply start/finish filters after the query?) I'm afraid this doesn't work, timescale docs says > start and finish arguments do not filter input rows. start/finish works after scan data(using where), it fetched data contains no value of `2022-12-01`, then we won't get it in upper plan node. I did following tests against timescale: ```sql CREATE TABLE stocks_real_time ( time TIMESTAMPTZ NOT NULL, price DOUBLE PRECISION NULL ); SELECT create_hypertable('stocks_real_time','time'); insert into stocks_real_time values ('2022-10-01', 10), ('2022-10-03', 30), ('2022-10-04', 40), ('2022-10-05', 50); SELECT time_bucket_gapfill('1 day', time, timestamp '2022-09-30', timestamp '2022-10-10') AS day, avg(price) AS value, locf(avg(price)), interpolate(avg(price)) FROM stocks_real_time WHERE time > '2022-10-02' AND time < '2022-10-05' GROUP BY day ORDER BY day; ``` It will output ``` | day | value | locf | interpolate | |------------------------+-------+------+-------------| | 2022-09-30 00:00:00+00 | | | | | 2022-10-01 00:00:00+00 | | | | | 2022-10-02 00:00:00+00 | | | | | 2022-10-03 00:00:00+00 | 30 | 30 | 30 | | 2022-10-04 00:00:00+00 | 40 | 40 | 40 | | 2022-10-05 00:00:00+00 | | 40 | | | 2022-10-06 00:00:00+00 | | 40 | | | 2022-10-07 00:00:00+00 | | 40 | | | 2022-10-08 00:00:00+00 | | 40 | | | 2022-10-09 00:00:00+00 | | 40 | | ``` >Another way I could imagine is to run a subquery that has the full range [2022-12-01, 2022-12-04] with timebucket_gap_fill and then apply a filter in an outer query to restrict the data to [2022-12-02, 2022-12-04] Subquery seems unnecessary, if time range in time_bucket_gapfill different with range in where clause, maybe we can overwrite where clause, and filter data in GapFill plan node, something like this(adopted from google docs above): ``` Projection: datebin(...) AS day, locf GapFill: groupBy=[[datebin_gapfill(..)]], aggr=[[locf(avg(price)) as locf]], original_time=(2022-10-02, 2022-10-05) Sort: cpu ASC NULLS LAST, datebin(...) ASC NULLS LAST Aggregate: groupBy=[[datebin(...) AS datebin(...)]], aggr=[] TableScan: cpu projection=[time, price] -- time range is rewritten as (2022-09-30, 2022-10-10) ``` @wolffcm Make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org