Re: [I] suggest add synax `select from generate_series()` [datafusion]

via GitHub Wed, 25 Sep 2024 02:21:04 -0700


Abdullahsab3 commented on issue #10069:
URL: https://github.com/apache/datafusion/issues/10069#issuecomment-2373531351


   I think switching `generate_series` to a UDTF will break the API. I think in 
most cases it's an acceptable API change. However there is (at least for me) 1 
particular use case that of the array generated from `generate_series` that 
will probably need closer attention.
   
   <details>
     <summary>Example</summary>
   I need to spread the data equally over equal intervals between 2 data 
points. I use `generate_series` to generate an array per row. The way that I do 
this now is as follows: 
   
     ```sql
   create table manual(time timestamp, index_value DOUBLE);
   insert into
       manual (time, index_value)
   values
       ('2022-01-31 23:00:00.0000000', 935035.000),
       ('2022-02-28 23:00:00.0000000', 954469.000),
       ('2022-03-31 22:00:00.0000000', 905204.000),
       ('2022-04-30 22:00:00.0000000', 971415.000);
    ```
   
   ```sql
   with
       index_values AS (
       select
           time AS to_time,
           lag(time, 1) over (order by time) AS from_time,
           index_value AS to_index,
           lag(index_value, 1) over (order by time) AS from_index
       from manual
   ),
       series AS (
           select
               unnest(generate_series(from_time, to_time, interval '1 month')) 
AS time,
               (case
                   when to_index < from_index THEN (999999 - from_index) + 
to_index
                   else to_index - from_index end)
                   / coalesce(nullif(array_length(generate_series(from_time, 
to_time, interval '1 month')), 0), 1) AS delta_value
           from index_values
       ) SELECT * FROM series;
   ```
   
   Specifically `array_length(generate_series(from_time, to_time, interval '1 
month'))` will need to be done differently if `generate_series` becomes a UDTF. 
It will probably need to become `COUNT(time) OVER (partition by to_time, 
from_time)` :
   
   ```sql
   with
       index_values AS (
       select
           time AS to_time,
           lag(time, 1) over (order by time) AS from_time,
           index_value AS to_index,
           lag(index_value, 1) over (order by time) AS from_index
       from manual
   ),
       series AS (
           select
               time,
               (case
                   when to_index < from_index THEN (999999 - from_index) + 
to_index
                   else to_index - from_index end)
                   / coalesce(nullif(count(time) over (partition by to_time, 
from_time), 0), 1) AS delta_value
           from index_values, generate_series(from_time, to_time, interval '900 
seconds') AS time
       ) SELECT * FROM series;
   ```
   
   (If you have other/better suggestions for how data can be 
spreaded/interpolated in raw DF, all ears :) )
   </details>
   
   TL;DR: If you're using `generate_series` to generate a series on a row-based 
basis, you may need to consider `OVER PARTITION BY (your_row_composite_key)` if 
this API change takes place
   
   Just mentioning this here in case anyone is using `generate_series` in a 
similar way as I do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] suggest add synax `select from generate_series()` [datafusion]

Reply via email to