westonpace commented on PR #12590:
URL: https://github.com/apache/arrow/pull/12590#issuecomment-1106822179

   I tried to create an example test case where batch_length mattered.  This 
meant using these UDFs as a projection expression in a plan.  I learned a lot 
about how things are currently implemented and unfortunately ran into a number 
of bugs.
   
   I don't think it makes sense to expose input types array & scalar.  This is 
not how the majority of our C++ compute kernels are created today.  It's also 
introducing a lot of confusion to a python user who has, before now, never 
really had to deal with these concepts.  Instead I think all UDFs should be 
registered with ValueDescr::ANY and we should just document that the function 
might receive an array or it might receive a scalar.  This will make a lot of 
sense to the user when they look at how they call it.
   
   ```
   pc.call_function('my_udf', [scalar_one, scalar_two])
   # The above should feel exactly like...
   my_udf(None, scalar_one, scalar_two)
   
   pc.call_function('my_udf', [array_one, array_two])
   # The above should feel exactly like...
   my_udf(None, array_one, array_two)
   ```
   
   This means that UDFs that need to handle arrays & scalars should look like:
   
   ```
   def flexible_udf(context, value):
     if isinstance(value, pa.Array):
       flexible_udf_array(value)
     else:
       flexible_udf_scalar(value)
   ```
   
   However, in the short term, it seems very unlikely that a UDF would ever be 
passed a scalar.  The only scalars the execution engine encounters today are 
the four augmented columns.  Plus, ARROW-16288 means these columns can't be 
used for projection correctly anyways.  In the long term, if I get my way on 
ARROW-16289 ;), we might still never see a scalar.  If there is concern for 
this then I would be open to also forcing all python UDFs to register as 
`ValueDescr::ARRAY`.
   
   In the future, if we start using more specific kernels, then we can 
introduce support for registering specific input types.  We can easily keep 
backwards compatibility by treating `in_types = { "my_input": int32() }` the 
same as `in_types = { "my_input": InputType.any(int32()) }`.  The only 
potentially breaking change would be if we got rid of `ValueDescr::ANY`.  
Although, even then, we could probably just register a scalar and an array 
version of the UDF and the UDF would have a redundant check in it.
   
   Test-wise, I think we should update all of these tests to apply UDFs via 
projection and not via `call_function`.  A user shouldn't every really need to 
use `call_function` with a UDF and so it makes more sense to test the path that 
will be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to