stuartcarnie opened a new issue, #3015:
URL: https://github.com/apache/arrow-datafusion/issues/3015

   **Introduction**
   
   This proposal suggests adding a new scalar function, `date_bin`,  to 
DataFusion, for transforming timestamp values to arbitrary intervals for the 
purpose of grouping and aggregating time-series data.
   
   **Motivation**
   
   Time-series data is typically analysed in aggregate where one axis is almost 
always time. DataFusion's 
[`date_trunc`](https://github.com/apache/arrow-datafusion/blob/cd3164918b0415b072a3109f3ccf654da7518ec1/datafusion/physical-expr/src/datetime_expressions.rs#L237)
 is modelled after the PostgreSQL 
[`date_trunc`](https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC)
 function, which allows truncating a timestamp column for the purpose of 
grouping, however, the intervals are limited to an enumeration, such as second, 
minute, hour, day, week, month, quarter and year. To address this limitation, 
PostgreSQL 14 introduced the 
[`date_bin`](https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-BIN)
 function, which can bin or adjust the input timestamp to arbitrary intervals.
   
   **Describe the solution you'd like**
   
   Add a new function, `date_bin` to DataFusion.
   
   **Name**: `date_bin(stride, source, origin)`
   
   Per the PostgreSQL 14 docs
   
   > The function date_bin “bins” the input timestamp into the specified 
interval (the stride) aligned with a specified origin.
   
   **Required arguments**
   
   | Name | Type | Description |
   | :---- | :---- | :---- |
   | `stride` | INTERVAL |  time interval specifying the width of each bin. |
   | `source` | TIMESTAMP | The timestamp to transform. |
   | `origin` | TIMESTAMP | Acts as an offset applied to the transformed 
timestamp |
   
   
   ----
   
   **Example Usage**
   
   Demonstrate `date_bin`[^1]:
   
   ```sql
   SELECT 
     DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP 
'2001-01-01') AS time, 
     val 
   FROM (
     VALUES 
       ('2021-06-10 17:05:00Z', 0.5),
       ('2021-06-10 17:19:10Z', 0.3)
     ) as t (time, val);
   ```
   
   [^1]: DataFusion does not support "typed string" literals in a `VALUES` 
statement, like `VALUES ((TIMESTAMP '2021-06-10 17:05:00Z'))`
   
   producing the following output:
   
   ```
   +---------------------+-----+
   | time                | val |
   +---------------------+-----+
   | 2021-06-10 17:00:00 | 0.5 |
   | 2021-06-10 17:15:00 | 0.3 |
   +---------------------+-----+
   2 rows in set. Query took 0.002 seconds.
   ```
   
   ----
   
   **Example Usage: time offset for origin**
   
   ```sql
   SELECT 
     DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP 
'2001-01-01 00:02:30') AS time, 
     val 
   FROM (
     VALUES 
       ('2021-06-10 17:05:00Z', 0.5),
       ('2021-06-10 17:19:10Z', 0.3)
     ) as t (time, val);
   ```
   
   producing the following output:
   
   ```
   +---------------------+-----+
   | time                | val |
   +---------------------+-----+
   | 2021-06-10 17:02:30 | 0.5 |
   | 2021-06-10 17:17:30 | 0.3 |
   +---------------------+-----+
   2 rows in set. Query took 0.002 seconds.
   ```
   
   
   **Describe alternatives you've considered**
   
   `date_trunc`, as mentioned, provides limited support for binning timestamps, 
but there is no alternative but to provide a native function.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to