jorgecarleitao commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-683467212
Thanks a lot @alamb for sharing that. I now see what you mean.
I do not think I was able to convince myself, likely because I did not fully
understand everything. So maybe you can help me; below is my reasoning so far:
#### A.
This introduces an interesting feature: a UDF is no longer uniquely
identifiable by its name: if a user registers two functions with equal names
(like `sqrt_32` and `sqrt_64` above), they both become part of the registry,
and it is up to the coercer to decide which one to use based on the type.
On the one hand, that would allow to register variants of the function,
which allows to extend an existing UDF; OTOH, it may not be very intuitive that
a single function "sqrt" can be registered multiple times, and it is its
signature and the coercer's logic that decides which one to use.
In particular, the registry can't warn or error if the user tries to
re-register a function with the same name: it must assume that every new
registration is a potential new variant that the coercer may pick up.
Wouldn't this also mean that users would need to ensure that they do not
register 2 traits with the same name and compatible `desired_return_type`? I
can imagine the situation where the coercer returns non-deterministic plans due
to two variants being valid, and one being picked up by chance. Or, that the
order on which functions of compatible signatures are registered dictates which
one is used.
#### B.
`concat::desired_argument_types` returns `[None, None]`.
I though that we wanted `concat` to be:
1. accept only `LargeUTF8` or `UTF8`
2. accept an arbitrary number of arguments (or the user needs to write/see
in the plan `concat(concat(concat(c1, c2), "-"), c4)` instead of `concat(c1,
c2, "-", c4)`)
Under this API, wouldn't we need to return `None` in
`desired_argument_types`? I think that this shifts gives the coercer an
impossible problem: since the coercer does not know which are possible valid
types of `concat` to coerce to (`[large]utf8`), it would need to scan through
all data types and use `valid_argument_types` as the predicate to select which
one to pick from.
This would be even more challenging for `array`. There, we want the coercer
to coerce every argument to a compatible type between themselves (what spark
does), but we only have access to a predicate, `valid_argument_types`. I can't
see how we could let the coercer know that it needs to coerce all arguments to
a common type without scanning all combinations of all arguments.
One alternative is to use something like the `enum Signature` from #8080 ,
that helps the coercer focus on the set of valid types. The thing is that we
can no longer guarantee non-overlapping signatures, as the same name may have
two traits with two compatible signatures (point A. above), and it is not so
easy to debug what it is happening.
#### C.
I think that we also need to have in the trait the functions' return type
function, something like
```
fn return_type(arg_types: &[DataType]) -> Result<DataType>;
```
so that we can check `f(g(c1), c2)` based on `g`'s return type.
#### D.
Wouldn't the users need to continue to preserve the invariant themselves
that the function's `PhysicalExpr::evaluate` coming from `make_physical_expr`
equals the function's incoming types? I think that they would continue to need
to go back and forth in the code to ensure that `PhysicalExpr`'s implementation
matches `struct concat` implementation of `desired_return_type`, like we also
do.
#### E.
Currently, the interface to register a UDF is:
```
let my_add = ScalarFunction::new(
"my_add",
vec![DataType::Int32, DataType::Int32],
DataType::Int32,
myfunc,
);
ctx.register_udf(my_add);
```
Wouldn't this new approach mean that a user would be unable to declare a UDF
inline; i.e. they would always need to implement the stateless trait? This is
not very relevant to SQL, but in the DataFrame API, being able to register
functions "on the fly" is quite neat, as it keeps its declaration closer to its
usage (in the code).
This inline is particularly powerful when the user does not even access the
registry, like spark supports (and there is nothing forbidding us from
supporting it too: we just defer the registration to a bit later).
An alternative that comes to mind to support all use-cases is to use 2
closures and 1 function signature:
```
ScalarFunction::new(
name: String,
function: ScalarUdf,
return_type: ReturnType,
signature: Signature,
)
```
where
```
pub type ScalarUdf = Arc<dyn Fn(&[ArrayRef]) -> Result<ArrayRef> + Send +
Sync>;
pub type ReturnType = Arc<dyn Fn(&Vec<DataType>) -> Result<DataType> + Send
+ Sync>;
enum Signature; // as in #8080
```
and offer some utility methods to register simpler functions (e.g.
```udf(name: String, fun: ScalarUdf, in: DataType, out: DataType)```.
Final note: this whole discussion may be a bit too detailed for udfs, and we
could instead offer a simpler interface to not handle these complex cases.
However, this whole discussion applies in almost the same way to built-in
functions: the only difference is whether they are of `static` lifetime or part
of a registry. That is why I am so interested in all the "hard" cases.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]