[GitHub] [arrow] icexelloss commented on a diff in pull request #35514: GH-35515: [C++][Python] Add non decomposable aggregation UDF

via GitHub Wed, 07 Jun 2023 07:24:04 -0700


icexelloss commented on code in PR #35514:
URL: https://github.com/apache/arrow/pull/35514#discussion_r1221691124



##########
python/pyarrow/src/arrow/python/udf.cc:
##########
@@ -234,6 +351,56 @@ Status RegisterTabularFunction(PyObject* user_function, 
UdfWrapperCallback wrapp
       wrapper, options, registry);
 }
 
+Status AddAggKernel(std::shared_ptr<compute::KernelSignature> sig, 
compute::KernelInit init,
+                           compute::ScalarAggregateFunction* func) {
+
+  compute::ScalarAggregateKernel kernel(std::move(sig), std::move(init), 
AggregateUdfConsume, AggregateUdfMerge, AggregateUdfFinalize, 
/*ordered=*/false);
+  RETURN_NOT_OK(func->AddKernel(std::move(kernel)));
+  return Status::OK();
+}
+
+Status RegisterAggregateFunction(PyObject* agg_function, UdfWrapperCallback 
agg_wrapper,
+                                               const UdfOptions& options,
+                                               compute::FunctionRegistry* 
registry) {
+  if (!PyCallable_Check(agg_function)) {
+    return Status::TypeError("Expected a callable Python object.");
+  }
+
+  if (registry == NULLPTR) {
+    registry = compute::GetFunctionRegistry();
+  }
+
+  static auto default_scalar_aggregate_options = 
compute::ScalarAggregateOptions::Defaults();
+  auto aggregate_func = std::make_shared<compute::ScalarAggregateFunction>(
+      options.func_name, options.arity, options.func_doc, 
&default_scalar_aggregate_options);
+
+  Py_INCREF(agg_function);

Review Comment:
   I looked into this more and observed the following.
   
   The refcount of Python function is 2 before registering with Arrow, and is 3 
after registering, this line increases the refcount from 2->3 which makes sense 
to be because we do not want the PyOjbect* to be gced if all Python refs are 
lost (since it is stilled referenced/registered with Arrow). After invoking the 
function with arrow compute, the refcount doesn't increase. (still 3) So I 
believe the code is correct.
   
   On the other hand, I did observe scalar UDF might have a refcount leaking (I 
observed the refcount increases by 1 after calling the function with arrow 
compute). Which I can look into separately.
   
   In summary, my understanding is that every time we wrap the PyObject* into a 
OwnedRef, we need to increase the refcount by 1 so the refcount doesn't change 
after the destructor of OwnedRef gets invoked. I am 90% certain this is correct 
understanding of how it should work but happy to hear more thoughts @westonpace 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] icexelloss commented on a diff in pull request #35514: GH-35515: [C++][Python] Add non decomposable aggregation UDF

Reply via email to