bkietz commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r633591259



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in 
https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute 
submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) 
contains analytical functions that process primarily columnar data for either 
scalar or Arrow-based array inputs. These are intended for use inside query 
engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or 
scalar operations on whole arrays at a time. Other functions are not SQL-like 
and compute results that may be a different length or whose results depend on 
the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that 
may have many different implementations corresponding to different combinations 
of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable 
kernel for executing a function is referred to as "dispatching". When executing 
a function on inputs, we must first select a suitable kernel corresponding to 
the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a 
"function registry". Given a function name and argument types, we can look up 
that function and dispatch to a compatible kernel.
+
+[Compute 
functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)
 have the following principal attributes:
+* A unique 
["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv)
 used for function invocation and language bindings
+* A 
["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output 
[types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and 
[shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of 
operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and 
subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h))
 contain 
["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels)
 which are implementations for specific argument signatures.
+* An 
["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE)
 which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or 
ternary, but can also be variadic.
+* 
["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
 describing the function's functionality and behavior
+* 
["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
 specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" 
and category, see 
https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, 
refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own 
kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's 
API. These `structs` contain static *Call* methods representing the unique 
implementation for each argument signature. Apache Arrow conforms to SFINAE and 
aliased-template conditionals to generalize kernel implementations for 
different argument types. Also, kernel implementations can have the *constexpr* 
specifier if applicable.

Review comment:
       The structures with static `Call` functions (such as `struct Add`) which 
are used to aid in efficient construction of some kernels are not identical to 
the kernels:
   
   ```suggestion
   Kernels are simple ``structs`` containing only function pointers (the 
"methods" of the kernel) and attribute flags. Each function kind corresponds to 
a class of Kernel with methods representing each stage of the function's 
execution. For example, :struct:`ScalarKernel` includes (optionally) 
:member:`ScalarKernel::init` to initialize any state necessary for execution 
and :member:`ScalarKernel::exec` to perform the computation.
   
   Since many kernels are closely related in operation and differ only in their 
input types, it's frequently useful to leverage c++'s powerful template system 
to efficiently generate kernels' methods. For example, the "add" compute 
function accepts all numeric types and its kernels' methods are instantiations 
of the same function template.
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in 
https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute 
submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) 
contains analytical functions that process primarily columnar data for either 
scalar or Arrow-based array inputs. These are intended for use inside query 
engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or 
scalar operations on whole arrays at a time. Other functions are not SQL-like 
and compute results that may be a different length or whose results depend on 
the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that 
may have many different implementations corresponding to different combinations 
of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable 
kernel for executing a function is referred to as "dispatching". When executing 
a function on inputs, we must first select a suitable kernel corresponding to 
the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a 
"function registry". Given a function name and argument types, we can look up 
that function and dispatch to a compatible kernel.
+
+[Compute 
functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)
 have the following principal attributes:
+* A unique 
["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv)
 used for function invocation and language bindings
+* A 
["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output 
[types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and 
[shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of 
operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and 
subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h))
 contain 
["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels)
 which are implementations for specific argument signatures.
+* An 
["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE)
 which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or 
ternary, but can also be variadic.
+* 
["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
 describing the function's functionality and behavior
+* 
["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
 specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" 
and category, see 
https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, 
refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own 
kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's 
API. These `structs` contain static *Call* methods representing the unique 
implementation for each argument signature. Apache Arrow conforms to SFINAE and 
aliased-template conditionals to generalize kernel implementations for 
different argument types. Also, kernel implementations can have the *constexpr* 
specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal 
code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" 
for arithmetic kernels. Helper macros are included to create the definitions 
which invoke the corresponding operation in 
[`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h)
 library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute 
functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a 
single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers 
over 
["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext)
 (one-shot function). Arrow provides macros to easily define compute functions 
based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and 
only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for 
overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for 
"Scalar Arithmetic" compute functions. Kernel definitions are defined via a 
class with literal name of compute function and containing methods named `Call` 
that are parameterized for specific input types (signed/unsigned integer and 
floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is 
a class suffixed with `Checked` and makes use of assertions and overflow 
checks. If overflow occurs, kernel returns zero and sets that `Status*` error 
flag.
+        * For compute functions that do not have a valid mathematical 
operation for specific datatypes (e.g., negate an unsigned integer), the kernel 
for those types is provided but should trigger an error with `DCHECK(false) << 
This is included only for the purposes of instantiability from the "arithmetic 
kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for 
invoking compute functions. A compute function should invoke `CallFunction` in 
its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default 
there is a single global registry where all kernels reside. `ExecContext` 
maintains a reference to the registry, if reference is NULL then the default 
registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options 
to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored 
library `portable_snippets` to perform "safe math" operations (e.g., 
arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h
 for each primitive datatype supported. Use the helper macros of the form 
`*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also 
contains helper functions for performing safe integral arithmetic for the 
kernels' default variant.
+
+The short-hand name maps to the predefined operation names in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033.
 For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name 
`add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and 
associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute 
function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute 
functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) 
to prevent a name collision. Note that the long form of names is preferred, and 
multi-word names are allowed due to the fact that string versions use an 
underscore instead of whitespace and C++ function names use camel case 
convention.
+     * What is a representative and unambiguous name for the operation 
performed by the compute function?
+     * If a related or variant form of a compute function is to be added in 
the future, is the current name extensible or specific enough to allow room for 
clear differentiation? For example, `str_length` is not a good name because 
there are different types of strings, so in this case it is preferable to be 
specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new 
compute function?
+        * If not, is it straightforward to add the missing parts or can the 
new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source 
files based on its "kind". For example the API of a "Scalar" function is found 
in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h 
and its definition in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on 
the compute function's "kind" and category. For example, a "Scalar" arithmetic 
function has kernels defined in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, 
and argument types of its operation. A `FunctionDoc` object is instantiated and 
used in the registration step. Note that for compute functions that can 
overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators 
support the `arity` and data types allowed for the new compute function. Kernel 
generators are not of the same form for all the kernel `kinds`. For example, in 
the "Scalar Arithmetic" kernels, registration functions have names of the form 
`MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, 
you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its 
`FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure 
stability of the compute layer. Tests should at least cover ordinary inputs, 
corner cases, extreme values, nulls, different data types, and invalid tests. 
Moreover, there can be kernel-specific tests. For example, for arithmetic 
kernels, tests should include `NaN` and `Inf` inputs. The test files are 
located alongside the kernel source files and suffixed with `_test`. Tests are 
grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for 
the given compute function. The `test suite name` is associated with a set of 
data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from 
multiple compute functions can be placed in the same test suite. For example, 
`TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and 
`TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+

Review comment:
       I think a simple (un-templated, non-suite) C++ test case is a necessary 
code snippet. For example,
   
   ```c++
   // scalar_arithmetic_test.cc
   TEST(AbsoluteValue, IntegralInputs) {
     for (auto type : {int8(), int16(), int32(), int64()}) {
       CheckScalarUnary("absolute_value", int8(), "[]", int8(), "[]");
   
       CheckScalarUnary("absolute_value", int8(), "[0, -1, 1, 2, -2, 3, -3]", 
int8(),
                        "[0, 1, 1, 2, 2, 3, 3]");
     }
   }
   ```
   
   Note that the above compiles and executes *before* adding the absolute_value 
function (IMHO it's a useful clarification of intended behavior to add a 
failing test like this as early as possible in the addition of a new function).

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in 
https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute 
submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) 
contains analytical functions that process primarily columnar data for either 
scalar or Arrow-based array inputs. These are intended for use inside query 
engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or 
scalar operations on whole arrays at a time. Other functions are not SQL-like 
and compute results that may be a different length or whose results depend on 
the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that 
may have many different implementations corresponding to different combinations 
of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable 
kernel for executing a function is referred to as "dispatching". When executing 
a function on inputs, we must first select a suitable kernel corresponding to 
the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a 
"function registry". Given a function name and argument types, we can look up 
that function and dispatch to a compatible kernel.
+
+[Compute 
functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)
 have the following principal attributes:
+* A unique 
["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv)
 used for function invocation and language bindings
+* A 
["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output 
[types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and 
[shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of 
operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and 
subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h))
 contain 
["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels)
 which are implementations for specific argument signatures.
+* An 
["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE)
 which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or 
ternary, but can also be variadic.
+* 
["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
 describing the function's functionality and behavior
+* 
["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
 specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" 
and category, see 
https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, 
refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own 
kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's 
API. These `structs` contain static *Call* methods representing the unique 
implementation for each argument signature. Apache Arrow conforms to SFINAE and 
aliased-template conditionals to generalize kernel implementations for 
different argument types. Also, kernel implementations can have the *constexpr* 
specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+

Review comment:
       Please describe this briefly, mentioning that it's used for interactive 
help in the bindings

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in 
https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute 
submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) 
contains analytical functions that process primarily columnar data for either 
scalar or Arrow-based array inputs. These are intended for use inside query 
engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or 
scalar operations on whole arrays at a time. Other functions are not SQL-like 
and compute results that may be a different length or whose results depend on 
the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that 
may have many different implementations corresponding to different combinations 
of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable 
kernel for executing a function is referred to as "dispatching". When executing 
a function on inputs, we must first select a suitable kernel corresponding to 
the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a 
"function registry". Given a function name and argument types, we can look up 
that function and dispatch to a compatible kernel.
+
+[Compute 
functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)
 have the following principal attributes:
+* A unique 
["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv)
 used for function invocation and language bindings
+* A 
["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output 
[types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and 
[shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of 
operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and 
subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h))
 contain 
["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels)
 which are implementations for specific argument signatures.
+* An 
["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE)
 which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or 
ternary, but can also be variadic.
+* 
["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
 describing the function's functionality and behavior
+* 
["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
 specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" 
and category, see 
https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, 
refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own 
kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's 
API. These `structs` contain static *Call* methods representing the unique 
implementation for each argument signature. Apache Arrow conforms to SFINAE and 
aliased-template conditionals to generalize kernel implementations for 
different argument types. Also, kernel implementations can have the *constexpr* 
specifier if applicable.
+
+Function options
+----------------
+

Review comment:
       Add an explanation and an example here please

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in 
https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute 
submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) 
contains analytical functions that process primarily columnar data for either 
scalar or Arrow-based array inputs. These are intended for use inside query 
engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or 
scalar operations on whole arrays at a time. Other functions are not SQL-like 
and compute results that may be a different length or whose results depend on 
the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that 
may have many different implementations corresponding to different combinations 
of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable 
kernel for executing a function is referred to as "dispatching". When executing 
a function on inputs, we must first select a suitable kernel corresponding to 
the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a 
"function registry". Given a function name and argument types, we can look up 
that function and dispatch to a compatible kernel.
+
+[Compute 
functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)
 have the following principal attributes:
+* A unique 
["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv)
 used for function invocation and language bindings
+* A 
["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output 
[types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and 
[shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of 
operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and 
subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h))
 contain 
["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels)
 which are implementations for specific argument signatures.
+* An 
["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE)
 which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or 
ternary, but can also be variadic.
+* 
["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
 describing the function's functionality and behavior
+* 
["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
 specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" 
and category, see 
https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, 
refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate

Review comment:
       ```suggestion
   Hash aggregate
   ~~~~~~~~~~~~~~
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in 
https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute 
submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) 
contains analytical functions that process primarily columnar data for either 
scalar or Arrow-based array inputs. These are intended for use inside query 
engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or 
scalar operations on whole arrays at a time. Other functions are not SQL-like 
and compute results that may be a different length or whose results depend on 
the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that 
may have many different implementations corresponding to different combinations 
of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable 
kernel for executing a function is referred to as "dispatching". When executing 
a function on inputs, we must first select a suitable kernel corresponding to 
the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a 
"function registry". Given a function name and argument types, we can look up 
that function and dispatch to a compatible kernel.
+
+[Compute 
functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)
 have the following principal attributes:
+* A unique 
["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv)
 used for function invocation and language bindings
+* A 
["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output 
[types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and 
[shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of 
operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and 
subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h))
 contain 
["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels)
 which are implementations for specific argument signatures.
+* An 
["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE)
 which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or 
ternary, but can also be variadic.
+* 
["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
 describing the function's functionality and behavior
+* 
["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
 specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" 
and category, see 
https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, 
refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own 
kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's 
API. These `structs` contain static *Call* methods representing the unique 
implementation for each argument signature. Apache Arrow conforms to SFINAE and 
aliased-template conditionals to generalize kernel implementations for 
different argument types. Also, kernel implementations can have the *constexpr* 
specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal 
code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" 
for arithmetic kernels. Helper macros are included to create the definitions 
which invoke the corresponding operation in 
[`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h)
 library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute 
functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a 
single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers 
over 
["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext)
 (one-shot function). Arrow provides macros to easily define compute functions 
based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and 
only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for 
overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for 
"Scalar Arithmetic" compute functions. Kernel definitions are defined via a 
class with literal name of compute function and containing methods named `Call` 
that are parameterized for specific input types (signed/unsigned integer and 
floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is 
a class suffixed with `Checked` and makes use of assertions and overflow 
checks. If overflow occurs, kernel returns zero and sets that `Status*` error 
flag.
+        * For compute functions that do not have a valid mathematical 
operation for specific datatypes (e.g., negate an unsigned integer), the kernel 
for those types is provided but should trigger an error with `DCHECK(false) << 
This is included only for the purposes of instantiability from the "arithmetic 
kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for 
invoking compute functions. A compute function should invoke `CallFunction` in 
its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default 
there is a single global registry where all kernels reside. `ExecContext` 
maintains a reference to the registry, if reference is NULL then the default 
registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options 
to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored 
library `portable_snippets` to perform "safe math" operations (e.g., 
arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h
 for each primitive datatype supported. Use the helper macros of the form 
`*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also 
contains helper functions for performing safe integral arithmetic for the 
kernels' default variant.
+
+The short-hand name maps to the predefined operation names in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033.
 For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name 
`add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and 
associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute 
function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute 
functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) 
to prevent a name collision. Note that the long form of names is preferred, and 
multi-word names are allowed due to the fact that string versions use an 
underscore instead of whitespace and C++ function names use camel case 
convention.
+     * What is a representative and unambiguous name for the operation 
performed by the compute function?
+     * If a related or variant form of a compute function is to be added in 
the future, is the current name extensible or specific enough to allow room for 
clear differentiation? For example, `str_length` is not a good name because 
there are different types of strings, so in this case it is preferable to be 
specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new 
compute function?
+        * If not, is it straightforward to add the missing parts or can the 
new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source 
files based on its "kind". For example the API of a "Scalar" function is found 
in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h 
and its definition in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on 
the compute function's "kind" and category. For example, a "Scalar" arithmetic 
function has kernels defined in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, 
and argument types of its operation. A `FunctionDoc` object is instantiated and 
used in the registration step. Note that for compute functions that can 
overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators 
support the `arity` and data types allowed for the new compute function. Kernel 
generators are not of the same form for all the kernel `kinds`. For example, in 
the "Scalar Arithmetic" kernels, registration functions have names of the form 
`MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, 
you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its 
`FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure 
stability of the compute layer. Tests should at least cover ordinary inputs, 
corner cases, extreme values, nulls, different data types, and invalid tests. 
Moreover, there can be kernel-specific tests. For example, for arithmetic 
kernels, tests should include `NaN` and `Inf` inputs. The test files are 
located alongside the kernel source files and suffixed with `_test`. Tests are 
grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for 
the given compute function. The `test suite name` is associated with a set of 
data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from 
multiple compute functions can be placed in the same test suite. For example, 
`TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and 
`TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 
10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and 
floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------

Review comment:
       Try to avoid ambiguity between the instance of `compute::Function` which 
is being added to the registry and the convenience function which is being 
added to `api_scalar.h`. The former is the canonical definition. The latter is 
a wrapper for easy use from C++, and probably isn't necessary for most compute 
functions. In fact for simplicity it might be better to avoid modifying 
`api_scalar.h` in this tutorial

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in 
https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute 
submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) 
contains analytical functions that process primarily columnar data for either 
scalar or Arrow-based array inputs. These are intended for use inside query 
engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or 
scalar operations on whole arrays at a time. Other functions are not SQL-like 
and compute results that may be a different length or whose results depend on 
the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that 
may have many different implementations corresponding to different combinations 
of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable 
kernel for executing a function is referred to as "dispatching". When executing 
a function on inputs, we must first select a suitable kernel corresponding to 
the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a 
"function registry". Given a function name and argument types, we can look up 
that function and dispatch to a compatible kernel.
+
+[Compute 
functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)
 have the following principal attributes:
+* A unique 
["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv)
 used for function invocation and language bindings
+* A 
["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output 
[types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and 
[shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of 
operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and 
subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h))
 contain 
["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels)
 which are implementations for specific argument signatures.
+* An 
["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE)
 which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or 
ternary, but can also be variadic.
+* 
["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
 describing the function's functionality and behavior
+* 
["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
 specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" 
and category, see 
https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, 
refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own 
kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's 
API. These `structs` contain static *Call* methods representing the unique 
implementation for each argument signature. Apache Arrow conforms to SFINAE and 
aliased-template conditionals to generalize kernel implementations for 
different argument types. Also, kernel implementations can have the *constexpr* 
specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal 
code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" 
for arithmetic kernels. Helper macros are included to create the definitions 
which invoke the corresponding operation in 
[`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h)
 library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute 
functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a 
single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers 
over 
["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext)
 (one-shot function). Arrow provides macros to easily define compute functions 
based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and 
only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for 
overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for 
"Scalar Arithmetic" compute functions. Kernel definitions are defined via a 
class with literal name of compute function and containing methods named `Call` 
that are parameterized for specific input types (signed/unsigned integer and 
floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is 
a class suffixed with `Checked` and makes use of assertions and overflow 
checks. If overflow occurs, kernel returns zero and sets that `Status*` error 
flag.
+        * For compute functions that do not have a valid mathematical 
operation for specific datatypes (e.g., negate an unsigned integer), the kernel 
for those types is provided but should trigger an error with `DCHECK(false) << 
This is included only for the purposes of instantiability from the "arithmetic 
kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for 
invoking compute functions. A compute function should invoke `CallFunction` in 
its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default 
there is a single global registry where all kernels reside. `ExecContext` 
maintains a reference to the registry, if reference is NULL then the default 
registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options 
to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored 
library `portable_snippets` to perform "safe math" operations (e.g., 
arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h
 for each primitive datatype supported. Use the helper macros of the form 
`*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also 
contains helper functions for performing safe integral arithmetic for the 
kernels' default variant.
+
+The short-hand name maps to the predefined operation names in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033.
 For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name 
`add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and 
associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute 
function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute 
functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) 
to prevent a name collision. Note that the long form of names is preferred, and 
multi-word names are allowed due to the fact that string versions use an 
underscore instead of whitespace and C++ function names use camel case 
convention.
+     * What is a representative and unambiguous name for the operation 
performed by the compute function?
+     * If a related or variant form of a compute function is to be added in 
the future, is the current name extensible or specific enough to allow room for 
clear differentiation? For example, `str_length` is not a good name because 
there are different types of strings, so in this case it is preferable to be 
specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new 
compute function?
+        * If not, is it straightforward to add the missing parts or can the 
new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source 
files based on its "kind". For example the API of a "Scalar" function is found 
in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h 
and its definition in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on 
the compute function's "kind" and category. For example, a "Scalar" arithmetic 
function has kernels defined in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, 
and argument types of its operation. A `FunctionDoc` object is instantiated and 
used in the registration step. Note that for compute functions that can 
overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators 
support the `arity` and data types allowed for the new compute function. Kernel 
generators are not of the same form for all the kernel `kinds`. For example, in 
the "Scalar Arithmetic" kernels, registration functions have names of the form 
`MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, 
you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its 
`FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure 
stability of the compute layer. Tests should at least cover ordinary inputs, 
corner cases, extreme values, nulls, different data types, and invalid tests. 
Moreover, there can be kernel-specific tests. For example, for arithmetic 
kernels, tests should include `NaN` and `Inf` inputs. The test files are 
located alongside the kernel source files and suffixed with `_test`. Tests are 
grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for 
the given compute function. The `test suite name` is associated with a set of 
data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from 
multiple compute functions can be placed in the same test suite. For example, 
`TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and 
`TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 
10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and 
floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = 
ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and 
overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which 
requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", 
"absolute_value_checked")

Review comment:
       In general, please avoid use of macros like `SCALAR_ARITHMETIC_UNARY()` 
and other context specific helpers for this walk through, as they obscure the 
underlying C++ and require a user to look up the macro's definition to 
understand what they're writing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to