alamb opened a new pull request, #6617: URL: https://github.com/apache/arrow-datafusion/pull/6617
# Which issue does this PR close? https://github.com/apache/arrow-datafusion/issues/5781 # Rationale for this change We would like to allow users to take full advantage of the power of DataFusion's window functions (largely contributed by @ozankabak and @mustafasrepo 👏 ) This PR contains a potential implementation of User Defined Window Functions: (the "Use existing APIs" approach described on https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1583105449) I don't intend to merge this specific PR. Instead, if the community likes this basic approach I will break this PR up into pieces and incrementally merge it # What changes are included in this PR? The new example in this PR shows how this works. Run ```shell cargo run --example simple_udwf ``` Which produces the following output (where `my_average`'s implementation is defined in `simple_udwf.rs` as a user defined window function): ``` +-------+-------+--------------------------+------------------------+---------------------+ | car | speed | LAG(cars.speed,Int64(1)) | my_average(cars.speed) | time | +-------+-------+--------------------------+------------------------+---------------------+ | red | 20.0 | | 20.0 | 1996-04-12T12:05:03 | | red | 20.3 | 20.0 | 20.15 | 1996-04-12T12:05:04 | | red | 21.4 | 20.3 | 20.85 | 1996-04-12T12:05:05 | | red | 21.5 | 21.4 | 21.45 | 1996-04-12T12:05:06 | | red | 19.0 | 21.5 | 20.25 | 1996-04-12T12:05:07 | | red | 18.0 | 19.0 | 18.5 | 1996-04-12T12:05:08 | | red | 17.0 | 18.0 | 17.5 | 1996-04-12T12:05:09 | | red | 7.0 | 17.0 | 12.0 | 1996-04-12T12:05:10 | | red | 7.1 | 7.0 | 7.05 | 1996-04-12T12:05:11 | | red | 7.2 | 7.1 | 7.15 | 1996-04-12T12:05:12 | | red | 3.0 | 7.2 | 5.1 | 1996-04-12T12:05:13 | | red | 1.0 | 3.0 | 2.0 | 1996-04-12T12:05:14 | | red | 0.0 | 1.0 | 0.5 | 1996-04-12T12:05:15 | | green | 10.0 | | 10.0 | 1996-04-12T12:05:03 | | green | 10.3 | 10.0 | 10.15 | 1996-04-12T12:05:04 | | green | 10.4 | 10.3 | 10.350000000000001 | 1996-04-12T12:05:05 | | green | 10.5 | 10.4 | 10.45 | 1996-04-12T12:05:06 | | green | 11.0 | 10.5 | 10.75 | 1996-04-12T12:05:07 | | green | 12.0 | 11.0 | 11.5 | 1996-04-12T12:05:08 | | green | 14.0 | 12.0 | 13.0 | 1996-04-12T12:05:09 | | green | 15.0 | 14.0 | 14.5 | 1996-04-12T12:05:10 | | green | 15.1 | 15.0 | 15.05 | 1996-04-12T12:05:11 | | green | 15.2 | 15.1 | 15.149999999999999 | 1996-04-12T12:05:12 | | green | 8.0 | 15.2 | 11.6 | 1996-04-12T12:05:13 | | green | 2.0 | 8.0 | 5.0 | 1996-04-12T12:05:14 | +-------+-------+--------------------------+------------------------+---------------------+ ``` Here are the major changes in this PR 1. Move `PartitionEvaluator` definition into datafusion_expr (much like the `Accumulator` trait for AggregateUDFs) 2. Moved `WindowAggState`, ` WindwFrameContext ` and some related structures to `datafusion_expr` (so the UDWF did not depend on `datafusion-physical-expr` 3. `Traiti`fy the built in state so `WindowUDF` did not depend on `datafusion-physical-expr` # Open questions: I think it may be possible to simplify the `PartitionEvaluator` to remove the state management which would make the needed changes (the amount of code that needs to be moved to `datafusion_expr`) smaller. I will try to do this as a separate PR # Outstaning cleanups I found a place where the optimizer special cases a particular window function which I think I can remove (and I will try to do so as separate PR https://github.com/apache/arrow-datafusion/blob/1af846bd8de387ce7a6e61a2008917a7610b9a7b/datafusion/core/src/physical_plan/windows/mod.rs#L254-L257 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
