[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #15083: GH-33566: [C++] Add support for nullary and n-ary aggregate functions

GitBox Wed, 18 Jan 2023 05:44:35 -0800


jorisvandenbossche commented on code in PR #15083:
URL: https://github.com/apache/arrow/pull/15083#discussion_r1073518753



##########
python/pyarrow/table.pxi:
##########
@@ -5358,36 +5358,45 @@ list[tuple(str, str, FunctionOptions)]
         ----
         values_sum: [[3,7,5]]
         keys: [["a","b","c"]]
+        >>> t.group_by("keys").aggregate([([], "count_all")])
+        pyarrow.Table
+        _count_all: int64
+        keys: string
+        ----
+        _count_all: [[2,2,1]]
+        keys: [["a","b","c"]]
         >>> t.group_by("keys").aggregate([])
         pyarrow.Table
         keys: string
         ----
         keys: [["a","b","c"]]
         """
-        columns = [a[0] for a in aggregations]
+        target_cols = [a[0] if isinstance(a[0], (list, tuple)) else [
+            a[0]] for a in aggregations]

Review Comment:
   Why is this change needed? Because now the target cols are always expected 
to be a list?



##########
python/pyarrow/table.pxi:
##########
@@ -5358,36 +5358,45 @@ list[tuple(str, str, FunctionOptions)]
         ----
         values_sum: [[3,7,5]]
         keys: [["a","b","c"]]
+        >>> t.group_by("keys").aggregate([([], "count_all")])
+        pyarrow.Table
+        _count_all: int64
+        keys: string
+        ----
+        _count_all: [[2,2,1]]
+        keys: [["a","b","c"]]
         >>> t.group_by("keys").aggregate([])
         pyarrow.Table
         keys: string
         ----
         keys: [["a","b","c"]]
         """
-        columns = [a[0] for a in aggregations]
+        target_cols = [a[0] if isinstance(a[0], (list, tuple)) else [
+            a[0]] for a in aggregations]

Review Comment:
   Looking further at the below changes, maybe we can combine all this 
extraction/validation in the single `for aggr in aggrfuncs:` loop below?  
   That might give more readable code (compared to several of those list 
comprehensions), something like:
   
   ```
   group_by_aggrs = []
   for aggr in aggregations:
       if len(aggr) == 2:
           target, func = aggr
           opt = None
       elif:
           target, func, opt = aggr
       # check func name ...
       # convert target to list if needed ...
       group_by_aggrs.append((target, func, opt))
   ```



##########
python/pyarrow/table.pxi:
##########
@@ -5358,36 +5358,45 @@ list[tuple(str, str, FunctionOptions)]
         ----
         values_sum: [[3,7,5]]
         keys: [["a","b","c"]]
+        >>> t.group_by("keys").aggregate([([], "count_all")])
+        pyarrow.Table
+        _count_all: int64
+        keys: string
+        ----
+        _count_all: [[2,2,1]]
+        keys: [["a","b","c"]]
         >>> t.group_by("keys").aggregate([])
         pyarrow.Table
         keys: string
         ----
         keys: [["a","b","c"]]
         """
-        columns = [a[0] for a in aggregations]
+        target_cols = [a[0] if isinstance(a[0], (list, tuple)) else [
+            a[0]] for a in aggregations]
         aggrfuncs = [
-            (a[1], a[2]) if len(a) > 2 else (a[1], None)
-            for a in aggregations
+            (target, a[1], a[2]) if len(a) > 2 else (target, a[1], None)
+            for (target, a) in zip(target_cols, aggregations)
         ]
 
         group_by_aggrs = []
         for aggr in aggrfuncs:
-            if not aggr[0].startswith("hash_"):
-                aggr = ("hash_" + aggr[0], aggr[1])
+            if not aggr[1].startswith("hash_"):
+                aggr = (aggr[0], "hash_" + aggr[1], aggr[2])
             group_by_aggrs.append(aggr)
 
         # Build unique names for aggregation result columns
         # so that it's obvious what they refer to.
-        column_names = [
-            aggr_name.replace("hash", col_name)
-            for col_name, (aggr_name, _) in zip(columns, group_by_aggrs)
+        out_column_names = [
+            aggr_name.replace("hash", "_".join(target))
+            for target, aggr_name, _ in group_by_aggrs

Review Comment:
   We might want to special case "count_all" here (or nullary kernels in 
general, if we can infer that), because right now this adds a leading 
underscore to the resulting name: `"_count_all"`



##########
python/pyarrow/table.pxi:
##########
@@ -5334,7 +5334,7 @@ class TableGroupBy:
         Parameters
         ----------
         aggregations : list[tuple(str, str)] or \
-list[tuple(str, str, FunctionOptions)]
+list[tuple(str|list[str]|tuple(str*), str, FunctionOptions)]

Review Comment:
   Basically, IIUC the "only" thing that this PR is adding is that the target 
column name (first argument of the tuple) can be a list of names instead of a 
single name (string). API wise I think that is reasonable? (I am not sure 
alternatives would be better, like not passing any column name and inferring 
that from the length of the tuple ..).
   
   (one alternative would be to make it more structured, and provide a some 
namedtuple or small class to define an "aggregation", and then this argument 
would be a list of such objects. But that won't necessarily make it easier to 
use, and we probably want to keep supporting the plain tuple as well)
   
   But personally I do find the above line explaining the type completely 
unreadable, though. I would simplify to just "list of tuples" and explain what 
each tuple can contain in the explanation below.



##########
python/pyarrow/_compute.pyx:
##########
@@ -2202,12 +2202,18 @@ def _group_by(args, keys, aggregations):
     _pack_compute_args(args, &c_args)
     _pack_compute_args(keys, &c_keys)
 
-    for aggr_func_name, aggr_opts in aggregations:
+    # reference into the flattened list of arguments for the aggregations
+    field_ref = 0
+    for aggr_arg_names, aggr_func_name, aggr_opts in aggregations:
         c_aggr.function = tobytes(aggr_func_name)
         if aggr_opts is not None:
             c_aggr.options = (<FunctionOptions?>aggr_opts).wrapped
         else:
             c_aggr.options = <shared_ptr[CFunctionOptions]>nullptr
+        for _ in aggr_arg_names:
+            c_aggr.target.push_back(CFieldRef(<int> field_ref))
+            field_ref += 1

Review Comment:
   Not for this PR, but would it make sense for this code to start referring by 
field name instead of integer field? (but then we would need to be able to pass 
the `args` as an actual table/recordbatch instead of a list of arrays?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #15083: GH-33566: [C++] Add support for nullary and n-ary aggregate functions

Reply via email to