jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-680244598
@alamb thank you very much for your comments, I will now work on addressing
them now. I still learning the Arc/Box/Ref, so thank you a lot for also
teaching me.
@andygrove , I agree with you that built-in functions should not require
access to the registry. Unfortunately, doing so required some re-work, which is
the reason I retracted #7967 back to draft to focus on this one first.
I pushed a new commit to this PR to address this point. Specifically, that
commit adds:
* a new enum with all built-in functions
* functionally gluing the logical plan with the physical plan so that the
function's return types are invariant.
* made type coercion on built-in functions to be on the physical plane, to
preserve schema invariance during planning.
I am pretty happy with this PR, as IMO has the flexibility we need to expand
DataFusion's pool of built-in functions to multiple input and return types. The
main features of this PR:
* users no longer have to pass the return type of the UDF when calling them
(the proposal)
* planning built-in functions continue to not need access to the registry
(@andygrove 's point)
* built-in functions now support multiple input types (e.g. `sqrt(f32)`,
`sqrt(f64)`)
* built-in functions now support multiple return types (e.g. `sqrt(f32) ->
f32`, `sqrt(f64) -> f64`)
* coercion rules are no longer applied in the sql planning or physical
planning to built-in functions, to avoid breaking schema invariance during
planning
I have not completed the valid return types of built-in math functions as
this PR was already too long.
Overall, I think that this has not been a pleasant experience for you
@andygrove and @alamb, as I constantly open and close PRs around
functions/UDFs, and for that I am really sorry. I've been hitting some design
challenge after another, which requires me to go back and forth.
I am still in pursuit of my original quests:
* built-in aggregate functions whose logical types are known from the
physical expressions
* type coercion on aggregate functions
* built-in aggregate functions whose return types (e.g. `min(f32) -> f32`,
`min(f64) -> f64`) are directly derived from the physical plan (there is an old
fixme/todo in the code around that)
* aggregate udfs
* udfs with multiple incoming and return types, to bring them to the same
level of functionality of built-ins
* planning a udf without registering it (a-la spark) in the DF's API.
I have code for some of this, I... just... need... to... finish... the...
scalar... stuff... first... 😃
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]