[GitHub] [arrow-datafusion] wolffcm opened a new issue, #4809: Support Gap Filling on Time Series Data

GitBox Tue, 03 Jan 2023 14:02:30 -0800


wolffcm opened a new issue, #4809:
URL: https://github.com/apache/arrow-datafusion/issues/4809


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   A common use case when working with time series data is to compute an 
aggregate value for windows of time, e.g., every minute, hour, week or 
whatever. It is possible to do this with the `DATE_BIN` function in DataFusion. 
However, `DATE_BIN` will not produce any value for a window that did not 
contain any rows.
   
   For example, for this input date:
   
   | time | c0 |
   | --- | --- |
   | 2022-12-01 | 10 |
   | 2022-12-03 | 30 |
   
   We might run this query;
   ```sql
   select
     date_bin(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
     avg(c0) 
   from t
   group by day;
   ```
   And we would get something like:
   
   | day | avg |
   | --- | --- |
   | 2022-12-01 | 10 |
   | 2022-12-03 | 30 |
   
   Generating a row in the output for `2022-12-02` is difficult to do with 
ANSI-SQL.  Here is one attempt: [Fill Gaps in Time Series with this simple 
trick in 
SQL](https://medium.com/codex/fill-gaps-in-time-series-with-this-simple-trick-in-sql-81ac655e5ad7).
 Having to write SQL like this for what is an intuitive and common use case is 
frustrating.
   
   **Describe the solution you'd like**
   
   It would be good to have a concise, idiomatic way to do this. Many vendors 
provide a solution for this problem. The have the following in common:
   - They provide a way to break up an interval of time into contiguous windows
   - They provide some kind of way to produce a value where there were no input 
rows
   
   One such solution would be to use a function like TimeScale's functions 
`time_bucket_gapfill` and `locf` (last observation carried forward):
   
https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/
   
   The above query might be changed to this, using `time_bucket_gapfill` and 
`locf`:
   ```sql
   select
     time_bucket_gapfill(interval '1 day', time, timestamp 
'1970-01-01T00:00:00Z') as day,
     avg(c0),
     locf(avg(c0))
   from t
   group by day;
   ```
   
   | day | avg | locf |
   | --- | --- | --- |
   | 2022-12-01 | 10 | 10 |
   | 2022-12-02 |  | 10 |   
   | 2022-12-03 | 30 | 30 |
   
   TimeScale also provides `interpolate` to populate a gap with an interpolated 
value (e.g., would put `20` in the gap for the example).
   
   **Describe alternatives you've considered**
   
   Postgres provides a general purpose way to generate data:
   https://www.postgresql.org/docs/9.1/functions-srf.html#FUNCTIONS-SRF-SERIES
   But this seems like it would be more difficult to use than something like 
`time_bucket_gapfill`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] wolffcm opened a new issue, #4809: Support Gap Filling on Time Series Data

Reply via email to