[GitHub] [arrow] jorgecarleitao commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

GitBox Fri, 28 Aug 2020 09:24:29 -0700


jorgecarleitao commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-682832105



   The discussion for me is not so much about the return type of the function 
`sqrt`. I am fine with `sqrt_f32`. My issue is with the output of the function 
`array`, `collect_list`, `concat`, and a significant number of other functions 
whose return type is variable.
   
   I do not think that other query engines are designed to have a fixed return 
type: `collect_list` has a variable return type, `getitem` has a variable 
return type; `array` has a variable return type. Essentially, every function 
that works on non-primitive types have a variable return type, because there 
are common operations on structs that are irrespective of the specific types 
that they hold.
   
   The functions that I see we will hit soon are functions that operate on 
strings: arrow supports utf8 and Largeutf8. Are we also making all functions 
that operate on strings (e.g. `concat`) always return `Largeutf8`, or `utf8`? 
What about binary and LargeBinary? I feel that by forcing a single return type, 
DataFusion drifts farther away from Arrow as we end up supporting a smaller and 
smaller subset of the types that arrow supports on its execution plans.
   
   For me, we should embrace the fact that most engines migrated to variable 
return types at some point in time during their life-time and design a 
typing/function system that caters for that from the get go, so that we have 
all options available to us when we want to expand our function set.
   
   At the moment, I am confident that the approach in #8032 covers all relevant 
use-cases, and is also fully aligned with how we do it for binary math and 
logical operators.
   
   Of course I would respect the decision to make all built-in functions/UDF of 
a fixed return type, and design the engine assuming that invariant, but it 
would be difficult for me to continue to contribute to that part of the code 
base when I am 90% certain that we will hit another design blocker when we want 
to implement `array([c1, c2])` or something. The only viable option I see to 
add a new function such as `array`, without an item in `Expr` that generically 
supports them, is to introduce a new entry to `enum Expr`. Doing so is backward 
incompatible, which means that we will introducing a backward incompatible 
change every time we need to add a new function of variable return type.
   
   What I struggle a bit to understand is that I have raised this concern 
before and no-one has offered a solution to it.
   Specifically, if we design our UDF's API (which currently supports all our 
built-in functions) as fixed return type, how will we support `array` and other 
scalar functions of variable return type without introducing a backward 
incompatible change on every new function?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

Reply via email to