westonpace commented on PR #14320:
URL: https://github.com/apache/arrow/pull/14320#issuecomment-1371227698

   >     2. One cannot register different implementations for different input 
types. This might be an annoying limitation  (and/or a performance issue if 
people work around it by checking types in Python).
   
   Assuming the actual UDF implementation is in C I think the "python UDF" is 
just mapping to the appropriate external library call and the external library 
will handle the type checking itself.  For example, I know we had an example at 
one point where we registered numpy's `gcd` function as a UDF.  Since numpy 
does its own type dispatch we could use the same kernel for all integral input 
types.
   
   If the actual UDF implementation is in python then I think performance 
doesn't matter in this context (e.g. prototyping) and they are probably mapping 
with something like `to_pylist` and so one callable would be able to handle all 
numeric types (e.g. in C we have `int32_t` and `float` but in python those 
would both just be `number`).
   
   >     1. The API is ugly.
   
   I'm not sure how best to address this.  If I start from scratch I think I'd 
end up with something a bit more complex but hopefully use friendly:
   
   ```
   class Udf(object):
       """
       A user defined function that can be registered with pyarrow so that
       it can be used in expression evaluation.
   
       Parameters
       ----------
       name : str
           The name of the function, function names must be unique
       doc : UdfDocumentation
           Optional object describing the function.  Can be used in
           interactive contexts to provide help to a user creating expressions
       kernels: List[UdfKernel]
           One or more kernels which map combinations of input types
           to python callables that provide the function implementation
   
   class UdfDocumentation(object):
       """
       Documentation that describes a user defined function.  This can be
       displayed as helpful information to users that are authoring compute
       expressions interactively.
   
       summary: str, optional
           A brief summary of the function, defaults to the function name
       description: str, optional
           An extended description that can be used to describe function 
details.
           Defaults to an empty string.
       arg_names: List[str], optional
           Names to give the arguments.  Defaults to ["arg0", "arg1", ...]
   
   class UdfKernel(object):
       """
       A mapping from input types to a python callable.  The same python 
callable
       can be used in multiple `UdfKernel` instances if the callable can handle 
arrays
       of different types.
   
       During execution of an expression Arrow will pick the appropriate kernel 
based
       on the types of the arguments.  A kernel will only be selected if the 
kernel's
       input types match the argument types exactly.
   
       All kernels within a Udf must have the same number of input types.
       """
       input_types: List[DataType]
           The input types expected by `exec`
       output_type: DataType
           The output type of the array returned by the `exec`
       exec: Callable
           The python callable to execute when this kernel is invoked
           <description of how the callable will be invoked>
   ```
   
   The example would then become:
   
   ```
   def test_multi_kernel_registration():
       def unary_function(ctx, x):
           return pc.cast(pc.call_function("multiply", [x, 2],
                                           memory_pool=ctx.memory_pool), x.type)
   
       func_name = "y=x*2"
       unary_doc = {"summary": "multiply by two function",
                    "description": "test multiply function"}
       input_types = [
           pa.int8(),
           pa.int16(),
           pa.int32(),
           pa.int64(),
           pa.float32(),
           pa.float64()
       ]
       kernels = [UdfKernel([in_type], in_type, unary_function) for in_type in 
input_types]
       func_doc = UdfDocumentation("multiply by two function", "test multiply 
function", ["array"])
       udf = Udf("y=x*2", kernels, doc=func_doc)
       pc.register_scalar_function(udf)
   ```
   
   @pitrou would that be a more favorable API or can you give more description 
of what you would like to change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to