wolffcm opened a new issue, #4809:
URL: https://github.com/apache/arrow-datafusion/issues/4809
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
A common use case when working with time series data is to compute an
aggregate value for windows of time, e.g., every minute, hour, week or
whatever. It is possible to do this with the `DATE_BIN` function in DataFusion.
However, `DATE_BIN` will not produce any value for a window that did not
contain any rows.
For example, for this input date:
| time | c0 |
| --- | --- |
| 2022-12-01 | 10 |
| 2022-12-03 | 30 |
We might run this query;
```sql
select
date_bin(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
avg(c0)
from t
group by day;
```
And we would get something like:
| day | avg |
| --- | --- |
| 2022-12-01 | 10 |
| 2022-12-03 | 30 |
Generating a row in the output for `2022-12-02` is difficult to do with
ANSI-SQL. Here is one attempt: [Fill Gaps in Time Series with this simple
trick in
SQL](https://medium.com/codex/fill-gaps-in-time-series-with-this-simple-trick-in-sql-81ac655e5ad7).
Having to write SQL like this for what is an intuitive and common use case is
frustrating.
**Describe the solution you'd like**
It would be good to have a concise, idiomatic way to do this. Many vendors
provide a solution for this problem. The have the following in common:
- They provide a way to break up an interval of time into contiguous windows
- They provide some kind of way to produce a value where there were no input
rows
One such solution would be to use a function like TimeScale's functions
`time_bucket_gapfill` and `locf` (last observation carried forward):
https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/
The above query might be changed to this, using `time_bucket_gapfill` and
`locf`:
```sql
select
time_bucket_gapfill(interval '1 day', time, timestamp
'1970-01-01T00:00:00Z') as day,
avg(c0),
locf(avg(c0))
from t
group by day;
```
| day | avg | locf |
| --- | --- | --- |
| 2022-12-01 | 10 | 10 |
| 2022-12-02 | | 10 |
| 2022-12-03 | 30 | 30 |
TimeScale also provides `interpolate` to populate a gap with an interpolated
value (e.g., would put `20` in the gap for the example).
**Describe alternatives you've considered**
Postgres provides a general purpose way to generate data:
https://www.postgresql.org/docs/9.1/functions-srf.html#FUNCTIONS-SRF-SERIES
But this seems like it would be more difficult to use than something like
`time_bucket_gapfill`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]