jorgecarleitao commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-674555985


   > I reviewed the logic carefully in this PR and I think overall it is quite 
good. Nice work @jorgecarleitao. The only thing I personally think is needed 
before it would be ready to merge are the following tests:
   > 
   >     1. End-to-end test (in sql.rs) using a UDF with multiple possible type 
signatures
   > 
   >     2. Unit tests for the the coercion logic
   > 
   > 
   > All the rest of the stuff in this PR is suggestions.
   > 
   > Along with the test, it might be cool to write up some docs / comments 
that shows how to write a UDF that actually accepts multiuple types)-- the only 
[test I see 
now](https://github.com/apache/arrow/pull/7967/files#diff-8273e76b6910baa123f3a25a967af3b5L1237)
 has a single set of argument types.
   > 
   > In this design, the UDFs are effectively "polymorphic" in the sense that 
they can accept multiple different argument types and will have to dispatch at 
runtime.
   > 
   > Another potential design for UDFs is to provide each UDF with an alias 
that could be duplicated and a single argument type (e.g `sqrt --> 
sqrt_32(f32)` and `sqrt --> sqrt_64(f64)`). Then an optimizer / coercion pass 
would have the logic to resolve the `sqrt` alias to `sqrt_32` or `sqrt_64` 
depending on the input argument types.
   > 
   > This approach might be marginally faster as the input types would be 
resolved once at planning time rather than during runtime. However, given that 
datafusion has a vectorized executor (e.g. the types would be handled once per 
RecordBatch) the overhead of runtime dispatch will likely not be noticable.
   
   Thank you again @alamb for taking the time to review this. I agree with all 
you said.
   
   My other two PRs effectively add support for polymorphic functions 
(including return type). The reason being that we are already doing that for 
our own udfs, with the function `data_type()/get_type()`, both at the physical 
and logical level. This is intrinsic to our execution model that requires 
downcasting and builders inside the function. Since we require the developer to 
go through that pain, we may as-well just offer them the full flexibility of 
that. I agree that there is a small overhead, but IMO small compared to 
execution, and we can always use some form of cached_attribute if we start 
seeing performance issues in the planning phase.
   
   As an example of the impressive flexibility we get, I was able to run Python 
lambdas inside a data-fusion UDF in a [pet project of 
mine](https://github.com/jorgecarleitao/datafusion-python), and convert the 
result back to numpy arrays, which IMO is mind blowing.
   
   Changes from last time:
   
   * incorporated almost all your suggestions
   * added end-to-end tests
   * added tests to both internal functions of type coercion
   
   I think that the logic of `get_supertype` is not entirely correct atm (e.g. 
utf8 can be converted to all types), but we have an issue tracking that.
   
   Currently I do struggle to know exactly where to place documentation for 
this. IMO this is not something to place on the API documentation, but rather 
on a user-guide. I have been using UDFs in a [pet project of 
mine](https://github.com/jorgecarleitao/datafusion-python), but I think that we 
need to take some time to create placeholders in our `docs/` for these. Pinging 
@andygrove for ideias also.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to