westonpace commented on PR #14320:
URL: https://github.com/apache/arrow/pull/14320#issuecomment-1371227698
> 2. One cannot register different implementations for different input
types. This might be an annoying limitation (and/or a performance issue if
people work around it by checking types in Python).
Assuming the actual UDF implementation is in C I think the "python UDF" is
just mapping to the appropriate external library call and the external library
will handle the type checking itself. For example, I know we had an example at
one point where we registered numpy's `gcd` function as a UDF. Since numpy
does its own type dispatch we could use the same kernel for all integral input
types.
If the actual UDF implementation is in python then I think performance
doesn't matter in this context (e.g. prototyping) and they are probably mapping
with something like `to_pylist` and so one callable would be able to handle all
numeric types (e.g. in C we have `int32_t` and `float` but in python those
would both just be `number`).
> 1. The API is ugly.
I'm not sure how best to address this. If I start from scratch I think I'd
end up with something a bit more complex but hopefully use friendly:
```
class Udf(object):
"""
A user defined function that can be registered with pyarrow so that
it can be used in expression evaluation.
Parameters
----------
name : str
The name of the function, function names must be unique
doc : UdfDocumentation
Optional object describing the function. Can be used in
interactive contexts to provide help to a user creating expressions
kernels: List[UdfKernel]
One or more kernels which map combinations of input types
to python callables that provide the function implementation
class UdfDocumentation(object):
"""
Documentation that describes a user defined function. This can be
displayed as helpful information to users that are authoring compute
expressions interactively.
summary: str, optional
A brief summary of the function, defaults to the function name
description: str, optional
An extended description that can be used to describe function
details.
Defaults to an empty string.
arg_names: List[str], optional
Names to give the arguments. Defaults to ["arg0", "arg1", ...]
class UdfKernel(object):
"""
A mapping from input types to a python callable. The same python
callable
can be used in multiple `UdfKernel` instances if the callable can handle
arrays
of different types.
During execution of an expression Arrow will pick the appropriate kernel
based
on the types of the arguments. A kernel will only be selected if the
kernel's
input types match the argument types exactly.
All kernels within a Udf must have the same number of input types.
"""
input_types: List[DataType]
The input types expected by `exec`
output_type: DataType
The output type of the array returned by the `exec`
exec: Callable
The python callable to execute when this kernel is invoked
<description of how the callable will be invoked>
```
The example would then become:
```
def test_multi_kernel_registration():
def unary_function(ctx, x):
return pc.cast(pc.call_function("multiply", [x, 2],
memory_pool=ctx.memory_pool), x.type)
func_name = "y=x*2"
unary_doc = {"summary": "multiply by two function",
"description": "test multiply function"}
input_types = [
pa.int8(),
pa.int16(),
pa.int32(),
pa.int64(),
pa.float32(),
pa.float64()
]
kernels = [UdfKernel([in_type], in_type, unary_function) for in_type in
input_types]
func_doc = UdfDocumentation("multiply by two function", "test multiply
function", ["array"])
udf = Udf("y=x*2", kernels, doc=func_doc)
pc.register_scalar_function(udf)
```
@pitrou would that be a more favorable API or can you give more description
of what you would like to change?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]