This is an automated email from the ASF dual-hosted git repository.
zanmato pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 5240670813 GH-46209: [Documentation][C++][Compute] Add cpp developer
documentation for row table (#46210)
5240670813 is described below
commit 5240670813a2dac6386eb854f060384a3db946d1
Author: Rossi Sun <[email protected]>
AuthorDate: Mon May 12 10:27:58 2025 -0700
GH-46209: [Documentation][C++][Compute] Add cpp developer documentation for
row table (#46210)
### What changes are included in this PR?
Add cpp developer documentation for row table, making it under the compute
category.
### Are these changes tested?
No need.
### Are there any user-facing changes?
None.
* GitHub Issue: #46209
Lead-authored-by: Rossi Sun <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Rossi Sun <[email protected]>
---
docs/source/cpp/index.rst | 20 +++-
docs/source/developers/cpp/compute.rst | 182 +++++++++++++++++++++++++++++++++
docs/source/developers/cpp/index.rst | 1 +
3 files changed, 201 insertions(+), 2 deletions(-)
diff --git a/docs/source/cpp/index.rst b/docs/source/cpp/index.rst
index ee0434ac0f..c844ed2faa 100644
--- a/docs/source/cpp/index.rst
+++ b/docs/source/cpp/index.rst
@@ -96,11 +96,26 @@ Welcome to the Apache Arrow C++ implementation
documentation!
To the API Reference
-.. grid:: 1
+.. grid:: 1 2 2 2
:gutter: 4
:padding: 2 2 0 0
:class-container: sd-text-center
+ .. grid-item-card:: C++ Development
+ :class-card: contrib-card
+ :shadow: none
+
+ Find guidelines and documentation for Arrow C++ developers
+
+ +++
+
+ .. button-link:: ../developers/cpp/index.html
+ :click-parent:
+ :color: primary
+ :expand:
+
+ To C++ Development
+
.. grid-item-card:: Cookbook
:class-card: contrib-card
:shadow: none
@@ -126,4 +141,5 @@ Welcome to the Apache Arrow C++ implementation
documentation!
user_guide
Examples <examples/index>
api
- C++ cookbook <https://arrow.apache.org/cookbook/cpp/>
+ C++ Development <../developers/cpp/index>
+ C++ Cookbook <https://arrow.apache.org/cookbook/cpp/>
diff --git a/docs/source/developers/cpp/compute.rst
b/docs/source/developers/cpp/compute.rst
new file mode 100644
index 0000000000..21391ff5fb
--- /dev/null
+++ b/docs/source/developers/cpp/compute.rst
@@ -0,0 +1,182 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. highlight:: console
+.. _development-cpp-compute:
+
+============================
+Developing Arrow C++ Compute
+============================
+
+This section provides information for developers of the Arrow C++ Compute
module.
+
+Row Table
+=========
+
+The row table in Arrow represents data stored in row-major format. This format
+is particularly useful for scenarios involving random access to individual rows
+and where all columns are frequently accessed together. It is especially
+advantageous for hash-table keys and facilitates efficient operations such as
+grouping and hash joins by optimizing memory access patterns and data locality.
+
+Metadata
+--------
+
+A row table is defined by its metadata, ``RowTableMetadata``, which includes
+information about its schema, alignment, and derived properties.
+
+The schema specifies the types and order of columns. Each row in the row table
+contains the data for each column in that logical order (the physical order may
+vary; see :ref:`row-encoding` for details).
+
+.. note::
+ Columns of nested types or large binary types are **not** supported in the
+ row table.
+
+One important property derived from the schema is whether the row table is
+fixed-length or varying-length. A fixed-length row table contains only
+fixed-length columns, while a varying-length row table includes at least one
+varying-length column. This distinction determines how data is stored and
+accessed in the row table.
+
+Each row in the row table is aligned to ``RowTableMetadata::row_alignment``
+bytes. Fixed-length columns with non-power-of-2 lengths are also aligned to
+``RowTableMetadata::row_alignment`` bytes. Varying-length columns are aligned
to
+``RowTableMetadata::string_alignment`` bytes.
+
+Buffer Layout
+-------------
+
+Similar to most Arrow ``Array``\s, the row table consists of three buffers:
+
+- **Null Masks Buffer**: Indicates null values for each column in each row.
+- **Fixed-length Buffer**: Stores row data for fixed-length tables or offsets
to
+ varying-length data for varying-length tables.
+- **Varying-length Buffer** (Optional): Contains row data for varying-length
+ tables; unused for fixed-length tables.
+
+Row Format
+----------
+
+Null Masks
+~~~~~~~~~~
+
+For each row, a contiguous sequence of bits represents whether each column in
+that row is null. Each bit corresponds to a specific column, with ``1``
+indicating the value is null and ``0`` indicating the value is valid. Note that
+this is the opposite of how the validity bitmap works for ``Array``\s. The null
+mask for a row occupies ``RowTableMetadata::null_masks_bytes_per_row`` bytes.
+
+Fixed-length Row Data
+~~~~~~~~~~~~~~~~~~~~~
+
+In a fixed-length row table, row data is directly stored in the fixed-length
+buffer. All columns in each row are stored sequentially. Notably, a ``boolean``
+column is special because, in a normal Arrow ``Array``, it is stored using 1
+bit, whereas in a row table, it occupies 1 byte. The varying-length buffer is
+not used in this case.
+
+For example, a row table with the schema ``(int32, boolean)`` and rows
+``[[7, false], [8, true], [9, false], ...]`` is stored in the fixed-length
+buffer as follows:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Row 0
+ - Row 1
+ - Row 2
+ - ...
+ * - ``7 0 0 0, 0 (padding)``
+ - ``8 0 0 0, 1 (padding)``
+ - ``9 0 0 0, 0 (padding)``
+ - ...
+
+Offsets for Varying-length Row Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In a varying-length row table, the fixed-length buffer contains offsets to the
+varying-length row data, which is stored separately in the optional
+varying-length buffer. The offsets are of type
``RowTableMetadata::offset_type``
+(fixed as ``int64_t``) and indicate the starting position of the row data for
+each row.
+
+Varying-length Row Data
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In a varying-length row table, the varying-length buffer contains the actual
row
+data, stored contiguously. The offsets in the fixed-length buffer point to the
+starting position of each row's data.
+
+.. _row-encoding:
+
+Row Encoding
+^^^^^^^^^^^^
+
+A varying-length row is encoded as follows:
+
+- Fixed-length columns are stored first.
+- A sequence of offsets to each varying-length column follows. Each offset is
+ 32-bit and indicates the **end** position within the row data of the
+ corresponding varying-length column.
+- Varying-length columns are stored last.
+
+For example, a row table with the schema ``(int32, string, string, int32)`` and
+rows ``[[7, 'Alice', 'x', 0], [8, 'Bob', 'y', 1], [9, 'Charlotte', 'z', 2],
...]``
+is stored as follows (assuming 8-byte alignment for varying-length columns):
+
+Fixed-length buffer (row offsets):
+
+.. list-table::
+ :header-rows: 1
+
+ * - Row 0
+ - Row 1
+ - Row 2
+ - Row 3
+ - ...
+ * - ``0 0 0 0 0 0 0 0``
+ - ``32 0 0 0 0 0 0 0``
+ - ``64 0 0 0 0 0 0 0``
+ - ``104 0 0 0 0 0 0 0``
+ - ...
+
+Varying-length buffer (row data):
+
+.. list-table::
+ :header-rows: 1
+
+ * - Row
+ - Fixed-length Cols
+ - Varying-length Offsets
+ - Varying-length Cols
+ * - 0
+ - ``7 0 0 0, 0 0 0 0``
+ - ``21 0 0 0, 25 0 0 0``
+ - ``Alice~~~x~~~~~~~``
+ * - 1
+ - ``8 0 0 0, 1 0 0 0``
+ - ``19 0 0 0, 25 0 0 0``
+ - ``Bob~~~~~y~~~~~~~``
+ * - 2
+ - ``9 0 0 0, 2 0 0 0``
+ - ``25 0 0 0, 33 0 0 0``
+ - ``Charlotte~~~~~~~z~~~~~~~``
+ * - 3
+ - ...
+ - ...
+ - ...
diff --git a/docs/source/developers/cpp/index.rst
b/docs/source/developers/cpp/index.rst
index 603e1607dc..ec97d4a62a 100644
--- a/docs/source/developers/cpp/index.rst
+++ b/docs/source/developers/cpp/index.rst
@@ -30,3 +30,4 @@ C++ Development
emscripten
conventions
fuzzing
+ compute