alamb opened a new issue, #8045: URL: https://github.com/apache/arrow-datafusion/issues/8045
### Is your feature request related to a problem or challenge? This is based on the wonderful writeup from @2010YOUY01 in https://github.com/apache/arrow-datafusion/issues/7977 As previously discussed in https://github.com/apache/arrow-datafusion/issues/7110 https://github.com/apache/arrow-datafusion/pull/7752 there are a few challenges with how ScalarFunctions are handled, notable that there are two distinct implementations -- `BuiltinScalarFunction` and `ScalarUDF` #### Problems with `BuiltinScalarFunction` 1. As more functions are added, the total footprint of DataFusion grows, even for those who don't need the specific functions. This also acts to limit the number of functions built into DataFusion 2. The desired semantics may be different for different users(e.g. many of the built in functions in DataFusion mirror postgres behavior, but some users wish to mimic spark behavior) 3. User defined functions are treated differently from built in functions in some ways (e.g. they can't have aliases) 4. built-in functions are implemented with `Enum BuiltinScalarFunction`, and function implementations like `return_type()` are large methods that match every enum variant. Adding a new function requires modifications in multiple places (not easy to add functions). #### Problems with `ScalarUDF` * The current implementation of `ScalarUDF`s as a struct, does not cover all the functionalities of existing built-in functions * Defining a new `ScalarUDF` requires constructing a struct in an imperative way providing `Arc` function pointers (see examples/simple_udf.rs), which is not familiar to Rust users where it is more common to see `dyn Trait` objects ### Describe the solution you'd like I propose moving DataFuaion to **only** define as `ScalarUDF`s. This will ensure: 1. ScalarUDFs have access to all the same functionality as "built in " functions 2. No function specific code will escape planning 3. DataFusion's core can remain focused, and external libraries of packages can be used to customize its use. We will keep the existing ScalarUDF interface as much as possible, while also potentially providing an easier way to define them (ideally via a trait object). ### Describe alternatives you've considered https://github.com/apache/arrow-datafusion/issues/7977 describes introducing a new trait and unifying both ScalarUDF and BuiltInScalarFunction with this trait. This approach also allows gradually migrating existing built-in functions to the new one, the old UDF interface `create_udf()` can keep unchanged. However, I think it is a bigger change for users, and ### Additional context Proposed implementation steps: - [ ] Prototype ScalarUDF interface changes (make the fields non `pub`): https://github.com/apache/arrow-datafusion/pull/8039 - [ ] Prototype how registering external packages would look like (by making a prototype for some BuildInFunctions) - [ ] Productionize ScalarUDF API changes - [ ] Break down the lists of packages and start extracting them into their own packages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
