westonpace commented on code in PR #14682:
URL: https://github.com/apache/arrow/pull/14682#discussion_r1086924083
##########
python/pyarrow/tests/test_udf.py:
##########
@@ -504,3 +504,112 @@ def test_input_lifetime(unary_func_fixture):
# Calling a UDF should not have kept `v` alive longer than required
v = None
assert proxy_pool.bytes_allocated() == 0
+
+
+def _record_batch_from_iters(schema, *iters):
+ arrays = [pa.array(list(v), type=schema[i].type)
+ for i, v in enumerate(iters)]
+ return pa.RecordBatch.from_arrays(arrays=arrays, schema=schema)
+
+
+def _record_batch_for_range(schema, n):
+ return _record_batch_from_iters(schema,
+ range(n, n + 10),
+ range(n + 1, n + 11))
+
+
+def make_udt_func(schema, batch_gen):
+ def udf_func(ctx):
+ class UDT:
+ def __init__(self):
+ self.caller = None
+
+ def __call__(self, ctx):
+ try:
+ if self.caller is None:
+ self.caller, ctx = batch_gen(ctx).send, None
+ batch = self.caller(ctx)
+ except StopIteration:
+ arrays = [pa.array([], type=field.type)
+ for field in schema]
+ batch = pa.RecordBatch.from_arrays(
+ arrays=arrays, schema=schema)
Review Comment:
> Regarding StopIteration, since it is handled by the UDT class, perhaps
what we'd want is to make (something like) this class part of the API so that
the user would only need to provide the generator.
Agreed. That would be a good solution. It can wait for a future PR if
needed.
> UDT arguments wouldn't be passed to the "init" nor to the "execute"
function but to the registered UDT function (like udf_func in the tester)
returned by the Python UDT maker (like make_udt_func in the tester). These
arguments, or values derived from them, can be saved on the returned UDT
instance.
First, to clarify, by "UDT arguments" I am referring to the input rows (i.e.
a UDT with non-null arity). I think that is what you are referring to as well.
I agree that these args will need to be passed to the equivalent of
`udf_func`. However, `udf_func` is currently called by the kernel's init
function.
In other words, if I add these prints:
```
// udf.cc CallTabularFunction
std::cout << "About to get executor" << std::endl;
ARROW_ASSIGN_OR_RAISE(auto func_exec,
GetFunctionExecutor(func_name, in_types, NULLPTR,
registry));
std::cout << "Retrieved executor" << std::endl;
// test_udf.py
def udf_func(ctx):
class UDT:
...
print("Creating UDT")
return UDT()
```
Then I see the output:
```
About to get executor
Creating UDT
Retrieved executor
```
The problem is that a kernel's init function does not accept arguments today
(e.g. you can't have a unary or binary init function). So you will need to
solve this problem when you're ready to create UDTs that have non-null arity.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]