jorgecarleitao commented on pull request #7967: URL: https://github.com/apache/arrow/pull/7967#issuecomment-682832105
The discussion for me is not so much about the return type of the function `sqrt`. I am fine with `sqrt_f32`. My issue is with the output of the function `array`, `collect_list`, `concat`, and a significant number of other functions whose return type is variable. I do not think that other query engines are designed to have a fixed return type: `collect_list` has a variable return type, `getitem` has a variable return type; `array` has a variable return type. Essentially, every function that works on non-primitive types have a variable return type, because there are common operations on structs that are irrespective of the specific types that they hold. The functions that I see we will hit soon are functions that operate on strings: arrow supports utf8 and Largeutf8. Are we also making all functions that operate on strings (e.g. `concat`) always return `Largeutf8`, or `utf8`? What about binary and LargeBinary? I feel that by forcing a single return type, DataFusion drifts farther away from Arrow as we end up supporting a smaller and smaller subset of the types that arrow supports on its execution plans. For me, we should embrace the fact that most engines migrated to variable return types at some point in time during their life-time and design a typing/function system that caters for that from the get go, so that we have all options available to us when we want to expand our function set. At the moment, I am confident that the approach in #8032 covers all relevant use-cases, and is also fully aligned with how we do it for binary math and logical operators. Of course I would respect the decision to make all built-in functions/UDF of a fixed return type, and design the engine assuming that invariant, but it would be difficult for me to continue to contribute to that part of the code base when I am 90% certain that we will hit another design blocker when we want to implement `array([c1, c2])` or something. The only viable option I see to add a new function such as `array`, without an item in `Expr` that generically supports them, is to introduce a new entry to `enum Expr`. Doing so is backward incompatible, which means that we will introducing a backward incompatible change every time we need to add a new function of variable return type. What I struggle a bit to understand is that I have raised this concern before and no-one has offered a solution to it. Specifically, if we design our UDF's API (which currently supports all our built-in functions) as fixed return type, how will we support `array` and other scalar functions of variable return type without introducing a backward incompatible change on every new function? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org