[GitHub] [arrow] ianmcook commented on a change in pull request #10887: ARROW-13311: [C++][Documentation] Document hash aggregate kernels

GitBox Wed, 18 Aug 2021 09:28:43 -0700


ianmcook commented on a change in pull request #10887:
URL: https://github.com/apache/arrow/pull/10887#discussion_r690709164




##########
File path: docs/source/cpp/compute.rst
##########
@@ -234,6 +238,90 @@ Notes:
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+* \(7) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Grouped Aggregations ("group by")
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Grouped aggregations are not directly invokable, but are used as part of a
+group by operation. Like scalar aggregations, grouped aggregations reduce
+multiple input values to a single output value. Instead of aggregating all
+values of the input, however, grouped aggregations partition of the input
+values on some set of "key" columns, then aggregate each group individually,
+and emit one output per input group.
+
+As an example, for the following table:
+
++-----------------+--------------+
+| Column "x"      | Column "key" |
++=================+==============+
+| 2               | "a"          |
++-----------------+--------------+
+| 5               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | null         |
++-----------------+--------------+
+| 5               | null         |
++-----------------+--------------+
+
+We compute a sum of column "x", grouped on the key column "key". This gives us
+three groups:
+
++-----------------+--------------+
+| Column "sum(x)" | Column "key" |
++=================+==============+
+| 7               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| 5               | null         |
++-----------------+--------------+
+
+The supported aggregation functions are as follows.

Review comment:
       Could you add a sentence here explaining the meaning of the `hash_` 
prefix in these function names?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -230,10 +234,64 @@ Notes:
   Note that the output can have less than *N* elements if the input has
   less than *N* distinct values.
 
+  The mode kernel is not a proper aggregate (it is actually a vector
+  function, see below).
+
 * \(5) Output is Int64, UInt64 or Float64, depending on the input type.
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+  The quantile kernel is not a proper aggregate (it is actually a vector
+  function, see below).
+
+* \(6) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Hash Aggregations ("group by")

Review comment:
       👍 I like "grouped aggregations"
   
   I think it's also worth explaining briefly what the meaning of "hash" is so 
that users understand why these function names all begin with `hash_`. I can 
imagine some confused user thinking this is a list of cryptographic hash 
functions.

##########
File path: docs/source/cpp/compute.rst
##########
@@ -230,10 +234,64 @@ Notes:
   Note that the output can have less than *N* elements if the input has
   less than *N* distinct values.
 
+  The mode kernel is not a proper aggregate (it is actually a vector
+  function, see below).
+
 * \(5) Output is Int64, UInt64 or Float64, depending on the input type.
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+  The quantile kernel is not a proper aggregate (it is actually a vector
+  function, see below).
+
+* \(6) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Hash Aggregations ("group by")

Review comment:
       👍 I like "grouped aggregations"
   
   I think it's also worth explaining briefly what the meaning of "hash" is so 
that users understand why these function names all begin with `hash_` (as noted 
in my other comment below). I can imagine some confused user thinking this is a 
list of cryptographic hash functions.

##########
File path: docs/source/cpp/compute.rst
##########
@@ -234,6 +238,93 @@ Notes:
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+* \(7) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Grouped Aggregations ("group by")
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Grouped aggregations are not directly invokable, but are used as part of a
+SQL-style "group by" operation. Like scalar aggregations, grouped aggregations
+reduce multiple input values to a single output value. Instead of aggregating
+all values of the input, however, grouped aggregations partition the input
+values on some set of "key" columns, then aggregate each group individually,
+emitting one output value per input group.
+
+As an example, for the following table:
+
++-----------------+--------------+
+| Column "x"      | Column "key" |
++=================+==============+
+| 2               | "a"          |
++-----------------+--------------+
+| 5               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | null         |
++-----------------+--------------+
+| 9               | null         |
++-----------------+--------------+

Review comment:
       I think this is clearer if `key` comes before `x`
   ```suggestion
   +-----------------+--------------+
   | Column ``key``  | Column ``x`` |
   +=================+==============+
   | "a"             | 2            |
   +-----------------+--------------+
   | "a"             | 5            |
   +-----------------+--------------+
   | "b"             | null         |
   +-----------------+--------------+
   | "b"             | null         |
   +-----------------+--------------+
   | null            | null         |
   +-----------------+--------------+
   | null            | 9            |
   +-----------------+--------------+
   ```

##########
File path: docs/source/cpp/compute.rst
##########
@@ -234,6 +238,93 @@ Notes:
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+* \(7) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Grouped Aggregations ("group by")
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Grouped aggregations are not directly invokable, but are used as part of a
+SQL-style "group by" operation. Like scalar aggregations, grouped aggregations
+reduce multiple input values to a single output value. Instead of aggregating
+all values of the input, however, grouped aggregations partition the input
+values on some set of "key" columns, then aggregate each group individually,
+emitting one output value per input group.
+
+As an example, for the following table:
+
++-----------------+--------------+
+| Column "x"      | Column "key" |
++=================+==============+
+| 2               | "a"          |
++-----------------+--------------+
+| 5               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | null         |
++-----------------+--------------+
+| 9               | null         |
++-----------------+--------------+
+
+we can compute a sum of the column "x", grouped on the column "key".
+This gives us three groups, with the following results. Note that null is
+treated as a distinct key value.
+
++-----------------+--------------+
+| Column "sum(x)" | Column "key" |
++=================+==============+
+| 7               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| 9               | null         |
++-----------------+--------------+

Review comment:
       ```suggestion
   +-----------------+-------------------+
   | Column ``key``  | Column ``sum(x)`` |
   +=================+===================+
   | "a"             | 7                 |
   +-----------------+-------------------+
   | "b"             | null              |
   +-----------------+-------------------+
   | null            | 9                 |
   +-----------------+-------------------+
   ```

##########
File path: docs/source/cpp/compute.rst
##########
@@ -234,6 +238,93 @@ Notes:
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+* \(7) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Grouped Aggregations ("group by")
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Grouped aggregations are not directly invokable, but are used as part of a
+SQL-style "group by" operation. Like scalar aggregations, grouped aggregations
+reduce multiple input values to a single output value. Instead of aggregating
+all values of the input, however, grouped aggregations partition the input
+values on some set of "key" columns, then aggregate each group individually,
+emitting one output value per input group.
+
+As an example, for the following table:
+
++-----------------+--------------+
+| Column "x"      | Column "key" |
++=================+==============+
+| 2               | "a"          |
++-----------------+--------------+
+| 5               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | null         |
++-----------------+--------------+
+| 9               | null         |
++-----------------+--------------+
+
+we can compute a sum of the column "x", grouped on the column "key".

Review comment:
       ```suggestion
   we can compute a sum of the column ``x``, grouped on the column ``key``.
   ```

##########
File path: docs/source/cpp/compute.rst
##########
@@ -234,6 +238,93 @@ Notes:
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+* \(7) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Grouped Aggregations ("group by")
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Grouped aggregations are not directly invokable, but are used as part of a
+SQL-style "group by" operation. Like scalar aggregations, grouped aggregations
+reduce multiple input values to a single output value. Instead of aggregating
+all values of the input, however, grouped aggregations partition the input
+values on some set of "key" columns, then aggregate each group individually,
+emitting one output value per input group.
+
+As an example, for the following table:
+
++-----------------+--------------+
+| Column "x"      | Column "key" |
++=================+==============+
+| 2               | "a"          |
++-----------------+--------------+
+| 5               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| null            | null         |
++-----------------+--------------+
+| 9               | null         |
++-----------------+--------------+
+
+we can compute a sum of the column "x", grouped on the column "key".
+This gives us three groups, with the following results. Note that null is
+treated as a distinct key value.
+
++-----------------+--------------+
+| Column "sum(x)" | Column "key" |
++=================+==============+
+| 7               | "a"          |
++-----------------+--------------+
+| null            | "b"          |
++-----------------+--------------+
+| 9               | null         |
++-----------------+--------------+
+
+The supported aggregation functions are as follows. All function names are
+prefixed with "hash\_", which differentiates them from their scalar

Review comment:
       ```suggestion
   prefixed with ``hash\_``, which differentiates them from their scalar
   ```

##########
File path: docs/source/cpp/compute.rst
##########
@@ -234,6 +238,93 @@ Notes:
 
 * \(6) Output is Float64 or input type, depending on QuantileOptions.
 
+* \(7) tdigest/t-digest computes approximate quantiles, and so only needs a
+  fixed amount of memory. See the `reference implementation
+  <https://github.com/tdunning/t-digest>`_ for details.
+
+Grouped Aggregations ("group by")
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Grouped aggregations are not directly invokable, but are used as part of a
+SQL-style "group by" operation. Like scalar aggregations, grouped aggregations
+reduce multiple input values to a single output value. Instead of aggregating
+all values of the input, however, grouped aggregations partition the input
+values on some set of "key" columns, then aggregate each group individually,
+emitting one output value per input group.
+
+As an example, for the following table:
+
++------------------+-----------------+
+| Column ``key``   | Column ``x``    |
++==================+=================+
+| "a"              | 2               |
++------------------+-----------------+
+| "a"              | 5               |
++------------------+-----------------+
+| "b"              | null            |
++------------------+-----------------+
+| "b"              | null            |
++------------------+-----------------+
+| null             | null            |
++------------------+-----------------+
+| null             | 9               |
++------------------+-----------------+
+
+we can compute a sum of the column ``x``, grouped on the column ``key``.
+This gives us three groups, with the following results. Note that null is
+treated as a distinct key value.
+
++------------------+-------------------+
+| Column ``key``   | Column ``sum(x)`` |
++==================+===================+
+| "a"              | 7                 |
++------------------+-------------------+
+| "b"              | null              |
++------------------+-------------------+
+| null             | 9                 |
++------------------+-------------------+

Review comment:
       When I render this, it requires even more spaces for the header text in 
the second column to not wrap.
   ```suggestion
   +------------------+-----------------------+
   | Column ``key``   | Column ``sum(x)``     |
   +==================+=======================+
   | "a"              | 7                     |
   +------------------+-----------------------+
   | "b"              | null                  |
   +------------------+-----------------------+
   | null             | 9                     |
   +------------------+-----------------------+
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ianmcook commented on a change in pull request #10887: ARROW-13311: [C++][Documentation] Document hash aggregate kernels

Reply via email to