westonpace commented on code in PR #36253:
URL: https://github.com/apache/arrow/pull/36253#discussion_r1245568309
##########
python/pyarrow/src/arrow/python/udf.cc:
##########
@@ -215,9 +247,164 @@ struct PythonUdfScalarAggregatorImpl : public
ScalarUdfAggregator {
return Status::OK();
}
- UdfWrapperCallback agg_cb;
+ std::shared_ptr<OwnedRefNoGIL> function;
+ UdfWrapperCallback cb;
std::vector<std::shared_ptr<RecordBatch>> values;
- std::shared_ptr<OwnedRefNoGIL> agg_function;
+ std::shared_ptr<Schema> input_schema;
+ std::shared_ptr<DataType> output_type;
+};
+
+struct PythonUdfHashAggregatorImpl : public HashUdfAggregator {
+ PythonUdfHashAggregatorImpl(std::shared_ptr<OwnedRefNoGIL> function,
+ UdfWrapperCallback cb,
+ std::vector<std::shared_ptr<DataType>>
input_types,
+ std::shared_ptr<DataType> output_type)
+ : function(function), cb(std::move(cb)),
output_type(std::move(output_type)) {
+ Py_INCREF(function->obj());
Review Comment:
These `INCREF`'s still seem superfluous to me but I don't think it's
critical. We could test in a follow-up using temporary function registries to
see if we are preventing UDF functions from being garbage collected.
##########
python/pyarrow/_compute.pyx:
##########
@@ -2767,6 +2767,9 @@ def register_aggregate_function(func, function_name,
function_doc, in_types, out
This is often used with ordered or segmented aggregation where groups
can be emit before accumulating all of the input data.
+ Note that currently size of any input column can not exceed 2 GB limit
+ (all groups combined).
Review Comment:
```suggestion
Note that currently the size of any input column can not exceed 2 GB
for a single segment (all groups combined).
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]