[GitHub] [arrow] westonpace commented on a diff in pull request #14682: GH-32916: [C++] [Python] User-defined tabular functions

via GitHub Wed, 25 Jan 2023 09:12:59 -0800


westonpace commented on code in PR #14682:
URL: https://github.com/apache/arrow/pull/14682#discussion_r1086924083



##########
python/pyarrow/tests/test_udf.py:
##########
@@ -504,3 +504,112 @@ def test_input_lifetime(unary_func_fixture):
     # Calling a UDF should not have kept `v` alive longer than required
     v = None
     assert proxy_pool.bytes_allocated() == 0
+
+
+def _record_batch_from_iters(schema, *iters):
+    arrays = [pa.array(list(v), type=schema[i].type)
+              for i, v in enumerate(iters)]
+    return pa.RecordBatch.from_arrays(arrays=arrays, schema=schema)
+
+
+def _record_batch_for_range(schema, n):
+    return _record_batch_from_iters(schema,
+                                    range(n, n + 10),
+                                    range(n + 1, n + 11))
+
+
+def make_udt_func(schema, batch_gen):
+    def udf_func(ctx):
+        class UDT:
+            def __init__(self):
+                self.caller = None
+
+            def __call__(self, ctx):
+                try:
+                    if self.caller is None:
+                        self.caller, ctx = batch_gen(ctx).send, None
+                    batch = self.caller(ctx)
+                except StopIteration:
+                    arrays = [pa.array([], type=field.type)
+                              for field in schema]
+                    batch = pa.RecordBatch.from_arrays(
+                        arrays=arrays, schema=schema)

Review Comment:
   > Regarding StopIteration, since it is handled by the UDT class, perhaps 
what we'd want is to make (something like) this class part of the API so that 
the user would only need to provide the generator.
   
   Agreed.  That would be a good solution.  It can wait for a future PR if 
needed.
   
   > UDT arguments wouldn't be passed to the "init" nor to the "execute" 
function but to the registered UDT function (like udf_func in the tester) 
returned by the Python UDT maker (like make_udt_func in the tester). These 
arguments, or values derived from them, can be saved on the returned UDT 
instance.
   
   First, to clarify, by "UDT arguments" I am referring to the input rows (i.e. 
a UDT with non-null arity).  I think that is what you are referring to as well. 
 I agree that these args will need to be passed to the equivalent of 
`udf_func`.  However, `udf_func` is currently called by the kernel's init 
function.
   
   In other words, if I add these prints:
   
   ```
   // udf.cc CallTabularFunction
     std::cout << "About to get executor" << std::endl;
     ARROW_ASSIGN_OR_RAISE(auto func_exec,
                           GetFunctionExecutor(func_name, in_types, NULLPTR, 
registry));
     std::cout << "Retrieved executor" << std::endl;
   
   // test_udf.py
       def udf_func(ctx):
           class UDT:
               ...
           print("Creating UDT")
           return UDT()
   ```
   
   Then I see the output:
   
   ```
   About to get executor
   Creating UDT
   Retrieved executor
   ```
   
   The problem is that a kernel's init function does not accept arguments today 
(e.g. you can't have a unary or binary init function).  So you will need to 
solve this problem when you're ready to create UDTs that have non-null arity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #14682: GH-32916: [C++] [Python] User-defined tabular functions

Reply via email to