[I] [EPIC] Unify Function Interface (remove `BuiltInScalarFunction`) [arrow-datafusion]

via GitHub Fri, 03 Nov 2023 11:17:20 -0700


alamb opened a new issue, #8045:
URL: https://github.com/apache/arrow-datafusion/issues/8045


   ### Is your feature request related to a problem or challenge?
   
   This is based on the wonderful writeup from @2010YOUY01  in 
https://github.com/apache/arrow-datafusion/issues/7977
   
   As previously discussed in 
https://github.com/apache/arrow-datafusion/issues/7110 
https://github.com/apache/arrow-datafusion/pull/7752  there are a few 
challenges with how ScalarFunctions are handled, notable that there are two 
distinct implementations -- `BuiltinScalarFunction` and `ScalarUDF`
   
   #### Problems with `BuiltinScalarFunction`
   
   1. As more functions are added, the total footprint of DataFusion grows, 
even for those who don't need the specific functions. This also acts to limit 
the number of functions built into DataFusion
   2. The desired semantics may be different for different users(e.g. many of 
the built in functions in DataFusion mirror postgres behavior, but some users 
wish to mimic spark behavior)
   3. User defined functions are treated differently from built in functions in 
some ways (e.g. they can't have aliases)
   4. built-in functions are implemented with `Enum BuiltinScalarFunction`, and 
function implementations like `return_type()` are large methods that match 
every enum variant. Adding a new function requires modifications in multiple 
places (not easy to add functions).
   
   #### Problems with `ScalarUDF`
   * The current implementation of `ScalarUDF`s as a struct, does not cover all 
the functionalities of existing built-in functions
   * Defining a new `ScalarUDF` requires constructing a struct in an imperative 
way providing `Arc` function pointers (see examples/simple_udf.rs), which is 
not familiar to Rust users where it is more common to see `dyn Trait` objects
   
   
   
   
   ### Describe the solution you'd like
   
   I propose moving DataFuaion to **only** define as `ScalarUDF`s. This will 
ensure:
   
   1. ScalarUDFs have access to all the same functionality as "built in " 
functions
   2. No function specific code will escape planning
   3. DataFusion's core can remain focused, and external libraries of packages 
can be used to customize its use. 
   
   We will keep the existing ScalarUDF interface as much as possible, while 
also potentially providing an easier way to define them (ideally via a trait 
object). 
   
   ### Describe alternatives you've considered
   
   https://github.com/apache/arrow-datafusion/issues/7977 describes introducing 
a new trait and unifying both ScalarUDF and BuiltInScalarFunction with this 
trait. 
   
   This approach also allows gradually migrating existing built-in functions to 
the new one, the old UDF interface `create_udf()` can keep unchanged.
   
   However, I think it is a bigger change for users, and 
   
   ### Additional context
   
   Proposed implementation steps:
   
   - [ ] Prototype ScalarUDF interface changes (make the fields non `pub`): 
https://github.com/apache/arrow-datafusion/pull/8039
   - [ ] Prototype how registering external packages would look like (by making 
a prototype for some BuildInFunctions)
   - [ ] Productionize ScalarUDF API changes
   - [ ] Break down the lists of packages and start extracting them into their 
own packages. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [EPIC] Unify Function Interface (remove `BuiltInScalarFunction`) [arrow-datafusion]

Reply via email to