[GitHub] [arrow] jorgecarleitao commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

GitBox Sun, 30 Aug 2020 13:33:48 -0700


jorgecarleitao commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-683467212



   Thanks a lot @alamb for sharing that. I now see what you mean.
   
   I do not think I was able to convince myself, likely because I did not fully 
understand everything. So maybe you can help me; below is my reasoning so far:
   
   #### A.
   
   This introduces an interesting feature: a UDF is no longer uniquely 
identifiable by its name: if a user registers two functions with equal names 
(like `sqrt_32` and `sqrt_64` above), they both become part of the registry, 
and it is up to the coercer to decide which one to use based on the type.
   
   On the one hand, that would allow to register variants of the function, 
which allows to extend an existing UDF; OTOH, it may not be very intuitive that 
a single function "sqrt" can be registered multiple times, and it is its 
signature and the coercer's logic that decides which one to use. 
   
   In particular, the registry can't warn or error if the user tries to 
re-register a function with the same name: it must assume that every new 
registration is a potential new variant that the coercer may pick up.
   
   Wouldn't this also mean that users would need to ensure that they do not 
register 2 traits with the same name and compatible `desired_return_type`? I 
can imagine the situation where the coercer returns non-deterministic plans due 
to two variants being valid, and one being picked up by chance. Or, that the 
order on which functions of compatible signatures are registered dictates which 
one is used.
   
   #### B.
   
   `concat::desired_argument_types` returns `[None, None]`.
   
   I though that we wanted `concat` to be:
   
   1. accept only `LargeUTF8` or `UTF8`
   2. accept an arbitrary number of arguments (or the user needs to write/see 
in the plan `concat(concat(concat(c1, c2), "-"), c4)` instead of `concat(c1, 
c2, "-", c4)`)
   
   Under this API, wouldn't we need to return `None` in 
`desired_argument_types`? I think that this shifts gives the coercer an 
impossible problem: since the coercer does not know which are possible valid 
types of `concat` to coerce to (`[large]utf8`), it would need to scan through 
all data types and use `valid_argument_types` as the predicate to select which 
one to pick from.
   
   This would be even more challenging for `array`. There, we want the coercer 
to coerce every argument to a compatible type between themselves (what spark 
does), but we only have access to a predicate, `valid_argument_types`. I can't 
see how we could let the coercer know that it needs to coerce all arguments to 
a common type without scanning all combinations of all arguments.
   
   One alternative is to use something like the `enum Signature` from #8080 , 
that helps the coercer focus on the set of valid types. The thing is that we 
can no longer guarantee non-overlapping signatures, as the same name may have 
two traits with two compatible signatures (point A. above), and it is not so 
easy to debug what it is happening.
   
   #### C.
   
   I think that we also need to have in the trait the functions' return type 
function, something like 
   
   ```
   fn return_type(arg_types: &[DataType]) -> Result<DataType>;
   ```
   
   so that we can check `f(g(c1), c2)` based on `g`'s return type.
   
   #### D.
   
   Wouldn't the users need to continue to preserve the invariant themselves 
that the function's `PhysicalExpr::evaluate` coming from `make_physical_expr` 
equals the function's incoming types? I think that they would continue to need 
to go back and forth in the code to ensure that `PhysicalExpr`'s implementation 
matches `struct concat` implementation of `desired_return_type`, like we also 
do.
   
   #### E.
   
   Currently, the interface to register a UDF is:
   
   ```
   let my_add = ScalarFunction::new(
       "my_add",
       vec![DataType::Int32, DataType::Int32],
       DataType::Int32,
       myfunc,
   );
   
   ctx.register_udf(my_add);
   ```
   
   Wouldn't this new approach mean that a user would be unable to declare a UDF 
inline; i.e. they would always need to implement the stateless trait? This is 
not very relevant to SQL, but in the DataFrame API, being able to register 
functions "on the fly" is quite neat, as it keeps its declaration closer to its 
usage (in the code). 
   
   This inline is particularly powerful when the user does not even access the 
registry, like spark supports (and there is nothing forbidding us from 
supporting it too: we just defer the registration to a bit later).
   
   An alternative that comes to mind to support all use-cases is to use 2 
closures and 1 function signature:
   
   ```
   ScalarFunction::new(
       name: String,
       function: ScalarUdf,
       return_type: ReturnType,
       signature: Signature,
   )
   ```
   
   where 
   
   ```
   pub type ScalarUdf = Arc<dyn Fn(&[ArrayRef]) -> Result<ArrayRef> + Send + 
Sync>;
   pub type ReturnType = Arc<dyn Fn(&Vec<DataType>) -> Result<DataType> + Send 
+ Sync>;
   
   enum Signature; // as in #8080
   ```
   
   and offer some utility methods to register simpler functions (e.g. 
```udf(name: String, fun: ScalarUdf, in: DataType, out: DataType)```.
   
   Final note: this whole discussion may be a bit too detailed for udfs, and we 
could instead offer a simpler interface to not handle these complex cases. 
However, this whole discussion applies in almost the same way to built-in 
functions: the only difference is whether they are of `static` lifetime or part 
of a registry. That is why I am so interested in all the "hard" cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

Reply via email to