[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11830: ARROW-13832: [Doc] Improve compute documentation

GitBox Thu, 02 Dec 2021 07:30:30 -0800


jorisvandenbossche commented on a change in pull request #11830:
URL: https://github.com/apache/arrow/pull/11830#discussion_r761203484




##########
File path: docs/source/python/compute.rst
##########
@@ -23,17 +23,32 @@ Compute Functions
 =================
 
 Arrow supports logical compute operations over inputs of possibly
-varying types.  Many compute functions support both array (chunked or not)
-and scalar inputs, but some will mandate either.  For example,
-``sort_indices`` requires its first and only input to be an array.
+varying types.  
 
-Below are a few simple examples:
+The standard compute operations are provided by the :mod:`pyarrow.compute`
+module and can be used directly::
 
    >>> import pyarrow as pa
    >>> import pyarrow.compute as pc
    >>> a = pa.array([1, 1, 2, 3])
    >>> pc.sum(a)
    <pyarrow.Int64Scalar: 7>
+
+The grouped aggregation functions are instead an exception 

Review comment:
       ```suggestion
   The grouped aggregation functions raise an exception instead
   ```

##########
File path: docs/source/python/api/compute.rst
##########
@@ -498,3 +498,50 @@ Structural Transforms
    make_struct
    replace_with_mask
    struct_field
+
+Compute Options
+---------------
+
+.. autosummary::
+   :toctree: ../generated/
+
+   ArraySortOptions
+   AssumeTimezoneOptions

Review comment:
       I am not sure it has much value to list them here, as long as they have 
no docstring.. (this will create a lot new doc pages, which will basically be 
empty)

##########
File path: docs/source/python/compute.rst
##########
@@ -62,8 +77,89 @@ Here is an example of sorting a table:
       0
     ]
 
-
+For a complete list of the compute functions that PyArrow provides
+you can refer to :ref:`api.compute` reference.
 
 .. seealso::
 
    :ref:`Available compute functions (C++ documentation) 
<compute-function-list>`.
+
+Grouped Aggregations
+====================
+
+PyArrow supports grouped aggregations over :class:`pyarrow.Table` through the
+:meth:`pyarrow.Table.group_by` method. 
+The method will return a grouping declaration
+to which the hash aggregation functions can be applied::
+
+   >>> import pyarrow as pa
+   >>> t = pa.table([
+   ...       pa.array(["a", "a", "b", "b", "c"]),
+   ...       pa.array([1, 2, 3, 4, 5]),
+   ... ], names=["keys", "values"])
+   >>> t.group_by("keys").aggregate([("values", "sum")])
+   pyarrow.Table
+   values_sum: int64
+   keys: string
+   ----
+   values_sum: [[3,7,5]]
+   keys: [["a","b","c"]]
+
+The ``"sum"`` aggregation passed to the ``aggregate`` method in the previous
+example is the ``hash_sum`` compute function.
+
+Multiple aggregations can be performed at the same time by providing them
+to the ``aggregate`` method::
+
+   >>> import pyarrow as pa
+   >>> t = pa.table([
+   ...       pa.array(["a", "a", "b", "b", "c"]),
+   ...       pa.array([1, 2, 3, 4, 5]),
+   ... ], names=["keys", "values"])
+   >>> t.group_by("keys").aggregate([
+   ...    ("values", "sum"),
+   ...    ("keys", "count")
+   ... ])
+   pyarrow.Table
+   values_sum: int64
+   keys_count: int64
+   keys: string
+   ----
+   values_sum: [[3,7,5]]
+   keys_count: [[2,2,1]]
+   keys: [["a","b","c"]]
+
+Aggregation options can also be provided for each aggregation function,
+for example we can use :class:`CountOptions` to change how we count
+null values::
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> table_with_nulls = pa.table([
+   ...    pa.array(["a", "a", "a"]),
+   ...    pa.array([1, None, None])
+   ... ], names=["keys", "values"])
+   >>> table_with_nulls.group_by(["keys"]).aggregate([
+   ...    ("values", "count", pc.CountOptions(mode="all"))
+   ... ])
+   pyarrow.Table
+   values_count: int64
+   keys: string
+   ----
+   values_count: [[3]]
+   keys: [["a"]]
+   >>> table_with_nulls.group_by(["keys"]).aggregate([
+   ...    ("values", "count", pc.CountOptions(mode="only_valid"))
+   ... ])
+   pyarrow.Table
+   values_count: int64
+   keys: string
+   ----
+   values_count: [[1]]
+   keys: [["a"]]
+
+Following is a list of all supported grouped aggregation functions.
+You can use them with or without the ``"hash_"`` prefix.
+
+.. arrow-computefuncs::
+  :kind: hash_aggregate

Review comment:
       Nice!

##########
File path: docs/source/python/compute.rst
##########
@@ -23,17 +23,32 @@ Compute Functions
 =================
 
 Arrow supports logical compute operations over inputs of possibly
-varying types.  Many compute functions support both array (chunked or not)
-and scalar inputs, but some will mandate either.  For example,
-``sort_indices`` requires its first and only input to be an array.
+varying types.  
 
-Below are a few simple examples:
+The standard compute operations are provided by the :mod:`pyarrow.compute`
+module and can be used directly::
 
    >>> import pyarrow as pa
    >>> import pyarrow.compute as pc
    >>> a = pa.array([1, 1, 2, 3])
    >>> pc.sum(a)
    <pyarrow.Int64Scalar: 7>
+
+The grouped aggregation functions are instead an exception 
+and need to be used through the :meth:`pyarrow.Table.group_by` capabilities.

Review comment:
       Maybe refer to the new section below?

##########
File path: docs/source/conf.py
##########
@@ -463,3 +464,49 @@ def setup(app):
     # This will also rebuild appropriately when the value changes.
     app.add_config_value('cuda_enabled', cuda_enabled, 'env')
     app.add_config_value('flight_enabled', flight_enabled, 'env')
+    app.add_directive('arrow-computefuncs', ComputeFunctionsTableDirective)
+
+
+class ComputeFunctionsTableDirective(Directive):
+    """Generate a table of Arrow compute functions.
+
+    .. arrow-computefuncs::
+        :kind: hash_aggregate
+
+    The generated table will include function name,
+    description and option class reference.
+
+    The functions listed in the table can be restricted
+    with the :kind: option.
+    """
+    has_content = True
+    option_spec = {
+        "kind": directives.unchanged
+    }
+
+    def run(self):
+        from docutils.statemachine import ViewList
+        from docutils import nodes
+        import pyarrow.compute as pc
+
+        result = ViewList()
+        function_kind = self.options.get('kind', None)
+
+        result.append(".. csv-table::", "<computefuncs>")

Review comment:
       Out of curiosity, what's the "<computefuncs>" for?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11830: ARROW-13832: [Doc] Improve compute documentation

Reply via email to