[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

GitBox Tue, 25 Aug 2020 13:11:11 -0700


jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-680244598



   @alamb thank you very much for your comments, I will now work on addressing 
them now. I still learning the Arc/Box/Ref, so thank you a lot for also 
teaching me.
   
   @andygrove , I agree with you that built-in functions should not require 
access to the registry. Unfortunately, doing so required some re-work, which is 
the reason I retracted #7967 back to draft to focus on this one first.
   
   I pushed a new commit to this PR to address this point. Specifically, that 
commit adds:
   
   * a new enum with all built-in functions
   * functionally gluing the logical plan with the physical plan so that the 
function's return types are invariant.
   * made type coercion on built-in functions to be on the physical plane, to 
preserve schema invariance during planning.
   
   I am pretty happy with this PR, as IMO has the flexibility we need to expand 
DataFusion's pool of built-in functions to multiple input and return types. The 
main features of this PR:
   
   * users no longer have to pass the return type of the UDF when calling them 
(the proposal)
   * planning built-in functions continue to not need access to the registry 
(@andygrove 's point)
   * built-in functions now support multiple input types (e.g. `sqrt(f32)`, 
`sqrt(f64)`)
   * built-in functions now support multiple return types (e.g. `sqrt(f32) -> 
f32`, `sqrt(f64) -> f64`)
   * coercion rules are no longer applied in the sql planning or physical 
planning to built-in functions, to avoid breaking schema invariance during 
planning
   
   I have not completed the valid return types of built-in math functions as 
this PR was already too long.
   
   Overall, I think that this has not been a pleasant experience for you 
@andygrove and @alamb, as I constantly open and close PRs around 
functions/UDFs, and for that I am really sorry. I've been hitting some design 
challenge after another, which requires me to go back and forth.
   
   I am still in pursuit of my original quests:
   
   * built-in aggregate functions whose logical types are known from the 
physical expressions
   * type coercion on aggregate functions
   * built-in aggregate functions whose return types (e.g. `min(f32) -> f32`, 
`min(f64) -> f64`) are directly derived from the physical plan (there is an old 
fixme/todo in the code around that)
   * aggregate udfs
   * udfs with multiple incoming and return types, to bring them to the same 
level of functionality of built-ins
   * planning a udf without registering it (a-la spark) in the DF's API.
    
   I have code for some of this, I... just... need... to... finish... the... 
scalar... stuff... first...  😃 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Reply via email to