mustafasrepo commented on PR #9221: URL: https://github.com/apache/arrow-datafusion/pull/9221#issuecomment-1955987551
> The only thing concerns me when I run tests I didn't see `.evaluate_all` has been called `evaluate_all` is called from `WindowAggExec`. `WindowAggExec` works when one of the window frame boundaries include `UNBOUNDED FOLLOWING`. Hence for the query below ```sql SELECT LAG(c9, 2) OVER(ORDER BY c9), SUM(c9) OVER(order by c9 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM aggregate_test_100; ``` `WindowAggExec` will be used. Unfortunately, from a SQL query I couldn't come up with a simpler reproducer. > One idea I had is to reverse input array before calling evaluate for only LEAD function and it potentially should work, but there reversing might be expensive Another approach might be for the table, we can construct a vector with same length to track non null count. Such as for the table below a | -- | 1 2 null 3 null 4 we would construct vector non_null_count | -- | 1 2 2 3 3 4 where count is incremented each time a new non-null value is seen. `LAG`, `LEAD` can work on this `vector` to determine the place of the value. For instance, for the `row index 5` in the original table (where `a=4`) `LAG(2)` should produce `2`. We can determine this by finding `non_null_count` of the row (which is 4 in our case). When we determine `LAG` on this table we can determine that `non_null_count` of the result should be `2`. Then we can find the index of first `2` in the `non_null_count` vector. (Other `2`s point to `null` values.). Index would be `1` which has value `a=2` in the table. `LEAD` can work similarly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
