westonpace commented on PR #12590:
URL: https://github.com/apache/arrow/pull/12590#issuecomment-1106822179
I tried to create an example test case where batch_length mattered. This
meant using these UDFs as a projection expression in a plan. I learned a lot
about how things are currently implemented and unfortunately ran into a number
of bugs.
I don't think it makes sense to expose input types array & scalar. This is
not how the majority of our C++ compute kernels are created today. It's also
introducing a lot of confusion to a python user who has, before now, never
really had to deal with these concepts. Instead I think all UDFs should be
registered with ValueDescr::ANY and we should just document that the function
might receive an array or it might receive a scalar. This will make a lot of
sense to the user when they look at how they call it.
```
pc.call_function('my_udf', [scalar_one, scalar_two])
# The above should feel exactly like...
my_udf(None, scalar_one, scalar_two)
pc.call_function('my_udf', [array_one, array_two])
# The above should feel exactly like...
my_udf(None, array_one, array_two)
```
This means that UDFs that need to handle arrays & scalars should look like:
```
def flexible_udf(context, value):
if isinstance(value, pa.Array):
flexible_udf_array(value)
else:
flexible_udf_scalar(value)
```
However, in the short term, it seems very unlikely that a UDF would ever be
passed a scalar. The only scalars the execution engine encounters today are
the four augmented columns. Plus, ARROW-16288 means these columns can't be
used for projection correctly anyways. In the long term, if I get my way on
ARROW-16289 ;), we might still never see a scalar. If there is concern for
this then I would be open to also forcing all python UDFs to register as
`ValueDescr::ARRAY`.
In the future, if we start using more specific kernels, then we can
introduce support for registering specific input types. We can easily keep
backwards compatibility by treating `in_types = { "my_input": int32() }` the
same as `in_types = { "my_input": InputType.any(int32()) }`. The only
potentially breaking change would be if we got rid of `ValueDescr::ANY`.
Although, even then, we could probably just register a scalar and an array
version of the UDF and the UDF would have a redundant check in it.
Test-wise, I think we should update all of these tests to apply UDFs via
projection and not via `call_function`. A user shouldn't every really need to
use `call_function` with a UDF and so it makes more sense to test the path that
will be used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]