[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11624: ARROW-14608: [Python] Provide access to hash_aggregate functions through a Table.group_by method

GitBox Mon, 22 Nov 2021 07:59:02 -0800


jorisvandenbossche commented on a change in pull request #11624:
URL: https://github.com/apache/arrow/pull/11624#discussion_r754406551




##########
File path: python/pyarrow/table.pxi
##########
@@ -2192,6 +2192,48 @@ cdef class Table(_PandasConvertible):
 
         return table
 
+    def group_by(self, keys):
+        """Declare a grouping over the columns of the table.
+
+        resulting grouping can then be used to perform aggregations.
+
+        Parameters
+        ----------
+        keys : str or list[str]
+            Name of the columns that should be used as the grouping key.
+
+        Returns
+        -------
+        TableGroupBy
+        """

Review comment:
       ```suggestion
   
           See Also
           --------
           TableGroupBy.aggregate
           """
   ```

##########
File path: python/pyarrow/table.pxi
##########
@@ -2192,6 +2192,48 @@ cdef class Table(_PandasConvertible):
 
         return table
 
+    def group_by(self, keys):
+        """Declare a grouping over the columns of the table.
+
+        resulting grouping can then be used to perform aggregations.

Review comment:
       ```suggestion
           Resulting grouping can then be used to perform aggregations
           with a subsequent ``aggregate()`` method.
   ```

##########
File path: python/pyarrow/table.pxi
##########
@@ -2192,6 +2192,48 @@ cdef class Table(_PandasConvertible):
 
         return table
 
+    def group_by(self, keys):
+        """Declare a grouping over the columns of the table.
+
+        resulting grouping can then be used to perform aggregations.
+
+        Parameters
+        ----------
+        keys : str or list[str]
+            Name of the columns that should be used as the grouping key.
+
+        Returns
+        -------
+        TableGroupBy
+        """
+        return TableGroupBy(self, keys)
+
+    def sort_by(self, sorting):
+        """
+        Sort the table by one or multiple columns.
+
+        Parameters
+        ----------
+        sorting : str or list[tuple(name, order)]
+            Name of the column to use to sort, or

Review comment:
       ```suggestion
               Name of the column to use to sort (ascending), or
   ```

##########
File path: python/pyarrow/tests/test_table.py
##########
@@ -1746,3 +1746,101 @@ def test_table_select():
     result = table.select(['f2'])
     expected = pa.table([a2], ['f2'])
     assert result.equals(expected)
+
+
+def test_table_group_by():
+    def sorted_by_keys(d):
+        # Ensure a guaranteed order of keys for aggregation results.
+        if "keys2" in d:
+            keys = tuple(zip(d["keys"], d["keys2"]))
+        else:
+            keys = d["keys"]
+        sorted_keys = sorted(keys)
+        sorted_d = {"keys": sorted(d["keys"])}
+        for entry in d:
+            if entry == "keys":
+                continue
+            values = dict(zip(keys, d[entry]))
+            for k in sorted_keys:
+                sorted_d.setdefault(entry, []).append(values[k])
+        return sorted_d
+
+    table = pa.table([
+        pa.array(["a", "a", "b", "b", "c"]),
+        pa.array(["X", "X", "Y", "Z", "Z"]),
+        pa.array([1, 2, 3, 4, 5]),
+        pa.array([10, 20, 30, 40, 50])
+    ], names=["keys", "keys2", "values", "bigvalues"])
+
+    r = table.group_by("keys").aggregate([
+        ("values", "hash_sum")
+    ])
+    assert sorted_by_keys(r.to_pydict()) == {
+        "keys": ["a", "b", "c"],
+        "values_sum": [3, 7, 5]
+    }
+
+    r = table.group_by("keys").aggregate([
+        ("values", "hash_sum"),
+        ("values", "hash_count")
+    ])
+    assert sorted_by_keys(r.to_pydict()) == {
+        "keys": ["a", "b", "c"],
+        "values_sum": [3, 7, 5],
+        "values_count": [2, 2, 1]
+    }
+
+    # Test without hash_ prefix
+    r = table.group_by("keys").aggregate([
+        ("values", "sum")
+    ])
+    assert sorted_by_keys(r.to_pydict()) == {
+        "keys": ["a", "b", "c"],
+        "values_sum": [3, 7, 5]
+    }
+
+    r = table.group_by("keys").aggregate([
+        ("values", "max"),
+        ("bigvalues", "sum")
+    ])
+    assert sorted_by_keys(r.to_pydict()) == {
+        "keys": ["a", "b", "c"],
+        "values_max": [2, 4, 5],
+        "bigvalues_sum": [30, 70, 50]
+    }
+
+    r = table.group_by("keys").aggregate([
+        ("bigvalues", "max"),
+        ("values", "sum")
+    ])
+    assert sorted_by_keys(r.to_pydict()) == {
+        "keys": ["a", "b", "c"],
+        "values_sum": [3, 7, 5],
+        "bigvalues_max": [20, 40, 50]
+    }
+
+    r = table.group_by(["keys", "keys2"]).aggregate([
+        ("values", "sum")
+    ])
+    assert sorted_by_keys(r.to_pydict()) == {
+        "keys": ["a", "b", "b", "c"],
+        "keys2": ["X", "Y", "Z", "Z"],
+        "values_sum": [3, 3, 4, 5]
+    }

Review comment:
       Do you have a test where you pass function options?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11624: ARROW-14608: [Python] Provide access to hash_aggregate functions through a Table.group_by method

Reply via email to