[GitHub] [arrow-datafusion] alamb opened a new issue, #4904: Improved performance of RANGE preceding window functions

GitBox Sat, 14 Jan 2023 01:38:45 -0800


alamb opened a new issue, #4904:
URL: https://github.com/apache/arrow-datafusion/issues/4904


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I am not sure this is an important feature to add, but I wanted to write it 
down.
   
   Basically there is an interesting algorithm called `Segment Tree` which is 
described in :
   
   > Efficient processing of window functions in analytical SQL queries
   > by Viktor Leis, Kan Kundhikanjana, Alfons Kemper, Thomas Neumann
   
   https://dl.acm.org/doi/10.14778/2794367.2794375
   
[p1058-leis.pdf](https://github.com/apache/arrow-datafusion/files/10417023/p1058-leis.pdf)
   
   This algorithm handles window functions with `RANGE` window functions well, 
at least that is the claim. It might be a reasonable structure to implement 
instead of the "MovingMinMax" added in 
https://github.com/apache/arrow-datafusion/pull/4675 by @berkaycpp  and 
@mustafasrepo .
   
   **Describe the solution you'd like**
   
   If we hit unacceptable performance of window functions (especially with 
largely varying `RANGE`), this might an algorithm worth looking into. 
   
   As a reminder a `RANGE` window frame is determined in terms of the values of 
the partition, not the number of rows:
   
   ```sql
   
   
   ❯ create table foo as values (1), (2), (3), (4), (5), (6), (6), (6);
   0 rows in set. Query took 0.000 seconds.
   
   ❯ select column1, first_value(column1) OVER (ORDER BY column1 RANGE 3 
PRECEDING) from foo;
   +---------+--------------------------+
   | column1 | FIRST_VALUE(foo.column1) |
   +---------+--------------------------+
   | 1       | 1                        |
   | 2       | 1                        |
   | 3       | 1                        |
   | 4       | 1                        |
   | 5       | 2                        |
   | 6       | 3                        |
   | 6       | 3                        |
   | 6       | 3                        |
   +---------+--------------------------+
   
   ❯ select column1, first_value(column1) OVER (ROWS 3 PRECEDING) from foo;
   +---------+--------------------------+
   | column1 | FIRST_VALUE(foo.column1) |
   +---------+--------------------------+
   | 1       | 1                        |
   | 2       | 1                        |
   | 3       | 1                        |
   | 4       | 1                        |
   | 5       | 2                        |
   | 6       | 3                        |
   | 6       | 4                        |
   | 6       | 5                        |
   +---------+--------------------------+
   8 rows in set. Query took 0.001 seconds.
   ```
   
   
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue, #4904: Improved performance of RANGE preceding window functions

Reply via email to