[GitHub] [arrow-datafusion] jiacai2050 commented on issue #4809: Support Gap Filling on Time Series Data

GitBox Tue, 10 Jan 2023 07:17:32 -0800


jiacai2050 commented on issue #4809:
URL: 
https://github.com/apache/arrow-datafusion/issues/4809#issuecomment-1377429211


   > The timebucket_gap_fill function 
([docs](https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/))
 can take an optional start and finish arguments which perhaps offers a way to 
express this case (apply start/finish filters after the query?)
   
   I'm afraid this doesn't work, timescale docs says
   > start and finish arguments do not filter input rows.
   
   start/finish works after scan data(using where), it fetched data contains no 
value of `2022-12-01`, then we won't get it in upper plan node.
   
   I did following tests against timescale:
   
   ```sql
   CREATE TABLE stocks_real_time (
     time TIMESTAMPTZ NOT NULL,
     price DOUBLE PRECISION NULL
   );
   
   SELECT create_hypertable('stocks_real_time','time');
   
   insert into stocks_real_time values
          ('2022-10-01', 10),
          ('2022-10-03', 30),
          ('2022-10-04', 40),
          ('2022-10-05', 50);
   
   SELECT
     time_bucket_gapfill('1 day', time, timestamp '2022-09-30', timestamp 
'2022-10-10') AS day,
     avg(price) AS value,
     locf(avg(price)),
     interpolate(avg(price))
   FROM stocks_real_time
   WHERE time > '2022-10-02' AND time < '2022-10-05'
   GROUP BY day
   ORDER BY day;
   ```
   It will output
   
   ```
   | day                    | value | locf | interpolate |
   |------------------------+-------+------+-------------|
   | 2022-09-30 00:00:00+00 |       |      |             |
   | 2022-10-01 00:00:00+00 |       |      |             |
   | 2022-10-02 00:00:00+00 |       |      |             |
   | 2022-10-03 00:00:00+00 |    30 |   30 |          30 |
   | 2022-10-04 00:00:00+00 |    40 |   40 |          40 |
   | 2022-10-05 00:00:00+00 |       |   40 |             |
   | 2022-10-06 00:00:00+00 |       |   40 |             |
   | 2022-10-07 00:00:00+00 |       |   40 |             |
   | 2022-10-08 00:00:00+00 |       |   40 |             |
   | 2022-10-09 00:00:00+00 |       |   40 |             |
   
   ```
   
   >Another way I could imagine is to run a subquery that has the full range 
[2022-12-01, 2022-12-04] with timebucket_gap_fill and then apply a filter in an 
outer query to restrict the data to [2022-12-02, 2022-12-04]
   
   Subquery seems unnecessary, if time range in time_bucket_gapfill different 
with range in where clause, maybe we can overwrite where clause, and filter 
data in GapFill plan node, something like this(adopted from google docs above):
   
   ```
   Projection: datebin(...) AS day, locf
     GapFill: groupBy=[[datebin_gapfill(..)]], aggr=[[locf(avg(price)) as 
locf]], original_time=(2022-10-02, 2022-10-05)
       Sort: cpu ASC NULLS LAST, datebin(...) ASC NULLS LAST
         Aggregate: groupBy=[[datebin(...) AS datebin(...)]], aggr=[]
           TableScan: cpu projection=[time, price]   -- time range is rewritten 
as (2022-09-30, 2022-10-10)
   ```
   
   @wolffcm Make sense?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] jiacai2050 commented on issue #4809: Support Gap Filling on Time Series Data

Reply via email to