Re: [PR] GH-41673: [Format][Docs] Add arrow format introductory page [arrow]

via GitHub Thu, 06 Jun 2024 01:09:01 -0700


jorisvandenbossche commented on code in PR #41593:
URL: https://github.com/apache/arrow/pull/41593#discussion_r1628894289



##########
docs/source/format/Intro.rst:
##########
@@ -242,7 +206,7 @@ are int64.
 
 .. figure:: ./images/var-list-diagram.svg
    :alt: Diagram is showing the difference between the variable size
-         list data type presented in a Table and the dataactually
+         list data type presented in a Table and the data actually

Review Comment:
   This change needs to be done for other alt texts as well



##########
docs/source/format/images/sparse-union-diagram.svg:
##########


Review Comment:
   I think in the new version, you indicated a null in the types buffer (with 
an underscore), but at that level (parent array), there is no validity bitmap, 
so there cannot be any null. A null in the logical array is AFAIK always a null 
of a specific type, i.e. a null in one of child arrays. So the types buffer 
still needs to point to one of the childs



##########
docs/source/format/Intro.rst:
##########
@@ -57,58 +57,22 @@ are adjacent in the computer memory:
 
 In a columnar format, on the other hand, the data is organised by column
 instead of by row making analytical operations like filtering, grouping,
-aggregations and others much more efficient. CPU can maintain memory locality
+aggregations and others more efficient because the CPU can maintain memory 
locality
 and require less memory jumps to process the data. By keeping the data 
contiguous
 in memory it also enables vectorization of the computations. Most modern
 CPUs have single instructions, multiple data (SIMD) enabling parallel
-processing and execution of instructions on vector data in single CPU
-instructions.
+processing and execution of operations on vector data using a single CPU
+instruction.
 
 .. figure:: ./images/columnar-diagram_3.svg
    :alt: Tabular data being structured column by column in computer memory.
 
-Overview of Arrow Terminology
-=============================
-
-**Physical layout**
-A specification for how to arrange values of an array in memory.
-
-**Buffer**
-A contiguous region of memory with a given length. Buffers are used to store 
data for arrays.
-
-**Array**
-A contiguous, one-dimensional sequence of values with known length where all 
values have the
-same type. An array consists of zero or more buffers.
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.

Review Comment:
   ```suggestion
   The column is called an **array** in Arrow terminology. Arrays can be of
   different types and the way their values are stored in memory varies among
   types. The specification of how these values are arranged in memory is what 
we
   call a **physical memory layout**. One contiguous region of memory that 
stores
   data for arrays is called a **buffer**.
   ```
   
   I would use bold instead of monospace to highlight those new terms, because 
code might incorrectly suggest this is an actual class or something (while in a 
specific implementation it is, but not in context of the spec)



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column

Review Comment:
   ```suggestion
   Data in a table can be represented in memory using a row based format or a 
column
   ```
   
   (depending on how you describe it above)



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others more efficient because the CPU can maintain memory 
locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of operations on vector data using a single CPU
+instruction.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+
+Support for null values
+=======================
+
+Arrow supports missing values or "nulls" for all data types: any value
+in an array may be semantically null, whether primitive or nested type.
+
+In Arrow, a dedicated buffer, known as the validity (or "null") bitmap,
+is used alongside the data indicating whether each value in the array is
+null or not: a value of 1
+means that the value is not-null ("valid"), whereas a value of 0 indicates 
that the value
+is null.
+
+This validity bitmap is optional: if there are no missing values in
+the array the buffer does not need to be allocated (as in the example
+column 1 in the diagram below).
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are, for example, signed and unsigned
+integer types, floating point numbers, boolean, decimal and temporal
+types.
+
+.. figure:: ./images/primitive-diagram.svg
+   :alt: Diagram is showing the difference between the primitive data
+         type presented in a Table and the data actually stored in
+         computer memory.
+
+   Physical layout diagram for primitive data types.
+
+.. note::
+   Boolean data type is represented with a primitive layout where the
+   values are encoded in bits instead of bytes. That means the physical
+   layout includes a values bitmap buffer and possibly a validity bitmap
+   buffer.
+
+   .. figure:: ./images/bool-diagram.svg
+      :alt: Diagram is showing the difference between the boolean data
+            type presented in a Table and the data actually stored in
+            computer memory.
+
+      Physical layout diagram for boolean data type.
+
+.. note::
+   Arrow also has a concept of Null type where all values are null. In
+   this case no buffers are allocated.
+
+Variable length binary and string
+---------------------------------
+
+The bytes of a binary or string column are stored together consecutively
+in a single buffer or region of memory. To know where each element of the
+column starts and ends the physical layout also includes integer offsets.
+The number of elements of the offset buffer is one more than the length of the
+array as the last two elements define the start and the end of the last
+element in the binary/string column.
+
+Binary and string types share the same physical layout. The one difference
+between them is that the string type is utf-8 binary and will produce an
+invalid result if the bytes are not valid utf-8.
+
+The difference between binary/string and large binary/string is in the offset
+type. In the first case that is int32 and in the second it is int64.
+
+The limitation of types using 32 bit offsets is that they have a max size of
+2GB per array. One can still use the non-large variants for bigger data, but
+then multiple chunks are needed.
+
+.. figure:: ./images/var-string-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string data type presented in a Table and the data actually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string data types.
+
+Variable length binary and string view
+--------------------------------------
+
+This layout is an alternative for the variable length binary layout and is 
adapted from TU Munich's `UmbraDB`_ and is similar to the string
+layout used in `DuckDB`_ and `Velox`_ (and sometimes also called "German style 
strings").
+
+.. _UmbraDB: https://umbra-db.com/
+.. _DuckDB: https://duckdb.com
+.. _Velox: https://velox-lib.io/
+The main differences to classical binary and string layout is the views buffer.

Review Comment:
   ```suggestion
   The main differences to the classical binary and string layout is the views 
buffer.
   ```



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others more efficient because the CPU can maintain memory 
locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of operations on vector data using a single CPU
+instruction.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+

Review Comment:
   At some point, we should also refer to the actual columnar specification. So 
maybe we could say something here that the next sections give an introduction 
to Arrow Columnar Format explaining the different phyisical layouts, but that 
the full specification can be found at \<ref>



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,454 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others much more efficient. CPU can maintain memory locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of instructions on vector data in single CPU
+instructions.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies between
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+
+Support for null values
+=======================
+
+Arrow supports missing values or "nulls" for all data types: any value
+in an array may be semantically null, whether primitive or nested type.
+
+In Arrow, a dedicated buffer, known as the validity (or "null") bitmap,
+is used alongside the data indicating whether each value in the array is
+null or not. You can think of it as vector of 0 and 1 values, where a 1
+means that the value is not-null ("valid"), while a 0 indicates the value
+is null.
+
+This validity bitmap is optional, i.e. if there are no missing values in
+the array the buffer does not need to be allocated (as in the example
+column 1 in the diagram below).
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are for example signed and unsigned
+integer types, floating point numbers, boolean, decimal and temporal
+types.
+
+.. figure:: ./images/primitive-diagram.svg
+   :alt: Diagram is showing the difference between the primitive data
+         type presented in a Table and the data actually stored in
+         computer memory.
+
+   Physical layout diagram for primitive data types.
+
+.. note::
+   Boolean data type is represented with a primitive layout where the
+   values are encoded in bits instead of bytes. That means the physical
+   layout includes a values bitmap buffer and possibly a validity bitmap
+   buffer.
+
+   .. figure:: ./images/bool-diagram.svg
+      :alt: Diagram is showing the difference between the boolean data
+            type presented in a Table and the data actually stored in
+            computer memory.
+
+      Physical layout diagram for boolean data type.
+
+.. note::
+   Arrow also has a concept of Null type where all values are null. In
+   this case no memory buffers are allocated.
+
+Variable length binary and string
+---------------------------------
+
+The bytes of a binary or string column are stored together consecutively
+in a single buffer or region of memory. To know where each element of the
+column starts and ends the physical layout also includes integer offsets.
+The length of the offset buffer is one more than the length of the values
+buffer as the last two elements define the start and the end of the last
+element in the binary/string column.
+
+Binary and string types share the same physical layout. The one difference
+between them is that the string type is utf-8 binary and will produce an
+invalid result if the bytes are not valid utf-8.
+
+The difference between binary/string and large binary/string is in the offset
+type. In the first case that is int32 and in the second it is int64.
+
+The limitation of types using 32 bit offsets is that they have a max size of
+2GB per array. One can still use the non-large variants for bigger data, but
+then multiple chunks are needed.
+
+.. figure:: ./images/var-string-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string data type presented in a Table and the data actually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string data types.
+
+Variable length binary and string view
+--------------------------------------
+
+This layout is an alternative for the variable length binary layout and is 
adapted from TU Munich's `UmbraDB`_ and is similar to the string
+layout used in `DuckDB`_ and `Velox`_ (and sometimes also called "German style 
strings").
+
+.. _UmbraDB: https://umbra-db.com/
+.. _DuckDB: https://duckdb.com
+.. _Velox: https://velox-lib.io/
+The main differences to classical binary and string layout is the views buffer.
+It includes the length of the string, and then either contains the characters
+inline (for small strings) or only the first 4 bytes of the string and point 
to a location in one of
+potentially several data buffers. It also supports binary and strings to be 
written
+out of order.
+
+These properties are important for efficient string processing. The prefix
+enables a profitable fast path for string comparisons, which are frequently
+determined within the first four bytes. Selecting elements is a simple "take"
+operation on the fixed-width views buffer and does not need to rewrite the
+values buffers.
+
+.. figure:: ./images/var-string-view-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string view data type presented in a Table and the dataactually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string view data type.
+
+Nested layouts
+==============
+
+Nested types introduce the concept of parent and child arrays. They express
+relationships between physical value arrays in a nested type structure.
+
+Nested types depend on one or more other child data types. For instance, List
+is a nested type (parent) that has one child (the data types of the values in
+the list).
+
+List
+----
+
+The list type enables values of the same type being stacked together in a
+sequence of values in each column slot. The layout is similar to binary or

Review Comment:
   Or maybe something like "The list type enables representing an array where 
each element is a collection of elements of the same type" ? (otherwise you 
could interpret the "collections" as the actual array, which is a bit confusing 
I think)



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others more efficient because the CPU can maintain memory 
locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of operations on vector data using a single CPU
+instruction.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+
+Support for null values
+=======================
+
+Arrow supports missing values or "nulls" for all data types: any value
+in an array may be semantically null, whether primitive or nested type.
+
+In Arrow, a dedicated buffer, known as the validity (or "null") bitmap,
+is used alongside the data indicating whether each value in the array is
+null or not: a value of 1
+means that the value is not-null ("valid"), whereas a value of 0 indicates 
that the value
+is null.
+
+This validity bitmap is optional: if there are no missing values in
+the array the buffer does not need to be allocated (as in the example
+column 1 in the diagram below).
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are, for example, signed and unsigned
+integer types, floating point numbers, boolean, decimal and temporal
+types.
+
+.. figure:: ./images/primitive-diagram.svg
+   :alt: Diagram is showing the difference between the primitive data
+         type presented in a Table and the data actually stored in
+         computer memory.
+
+   Physical layout diagram for primitive data types.
+
+.. note::
+   Boolean data type is represented with a primitive layout where the
+   values are encoded in bits instead of bytes. That means the physical
+   layout includes a values bitmap buffer and possibly a validity bitmap
+   buffer.
+
+   .. figure:: ./images/bool-diagram.svg
+      :alt: Diagram is showing the difference between the boolean data
+            type presented in a Table and the data actually stored in
+            computer memory.
+
+      Physical layout diagram for boolean data type.
+
+.. note::
+   Arrow also has a concept of Null type where all values are null. In
+   this case no buffers are allocated.
+
+Variable length binary and string
+---------------------------------
+
+The bytes of a binary or string column are stored together consecutively
+in a single buffer or region of memory. To know where each element of the
+column starts and ends the physical layout also includes integer offsets.
+The number of elements of the offset buffer is one more than the length of the
+array as the last two elements define the start and the end of the last
+element in the binary/string column.
+
+Binary and string types share the same physical layout. The one difference
+between them is that the string type is utf-8 binary and will produce an
+invalid result if the bytes are not valid utf-8.
+
+The difference between binary/string and large binary/string is in the offset
+type. In the first case that is int32 and in the second it is int64.
+
+The limitation of types using 32 bit offsets is that they have a max size of
+2GB per array. One can still use the non-large variants for bigger data, but
+then multiple chunks are needed.
+
+.. figure:: ./images/var-string-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string data type presented in a Table and the data actually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string data types.
+
+Variable length binary and string view
+--------------------------------------
+
+This layout is an alternative for the variable length binary layout and is 
adapted from TU Munich's `UmbraDB`_ and is similar to the string
+layout used in `DuckDB`_ and `Velox`_ (and sometimes also called "German style 
strings").
+
+.. _UmbraDB: https://umbra-db.com/
+.. _DuckDB: https://duckdb.com
+.. _Velox: https://velox-lib.io/
+The main differences to classical binary and string layout is the views buffer.
+It includes the length of the string, and then either contains the characters
+inline (for small strings) or only the first 4 bytes of the string and an 
offset into one of
+potentially several data buffers. It also supports binary and strings to be 
written
+out of order.
+
+These properties are important for efficient string processing. The prefix
+enables a profitable fast path for string comparisons, which are frequently
+determined within the first four bytes. Selecting elements is a simple "take"
+operation on the fixed-width views buffer and does not need to rewrite the
+values buffers.
+
+.. figure:: ./images/var-string-view-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string view data type presented in a Table and the dataactually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string view data type.
+
+Nested layouts
+==============
+
+Nested types introduce the concept of parent and child arrays. They express
+relationships between physical value arrays in a nested type structure.
+
+Nested types depend on one or more other child data types. For instance, List
+is a nested type (parent) that has one child (the data types of the values in
+the list).
+
+List
+----
+
+The list type enables values of the same type being stacked together in a
+sequence of values in each column slot. The layout is similar to binary or
+string type as it has offsets buffer to define where the sequence of values

Review Comment:
   ```suggestion
   sequence of values in each column slot. The layout is similar to 
variable-size binary or
   string layout as it has offsets buffer to define where the sequence of values
   ```



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others more efficient because the CPU can maintain memory 
locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of operations on vector data using a single CPU
+instruction.

Review Comment:
   I know the title of this page and section says "Columnar Format", I think it 
would be good to explicitly call out that Arrow is a specification using this 
columnar layout.



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others more efficient because the CPU can maintain memory 
locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of operations on vector data using a single CPU
+instruction.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+
+Support for null values
+=======================
+
+Arrow supports missing values or "nulls" for all data types: any value
+in an array may be semantically null, whether primitive or nested type.
+
+In Arrow, a dedicated buffer, known as the validity (or "null") bitmap,
+is used alongside the data indicating whether each value in the array is
+null or not: a value of 1
+means that the value is not-null ("valid"), whereas a value of 0 indicates 
that the value
+is null.
+
+This validity bitmap is optional: if there are no missing values in
+the array the buffer does not need to be allocated (as in the example
+column 1 in the diagram below).
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are, for example, signed and unsigned

Review Comment:
   ```suggestion
   has the same physical size measured in bytes. Data types that use the
   fixed size primitive layout are, for example, signed and unsigned
   ```
   
   (share the "same" might be incorrectly interpreted as _exactly_ the same, 
while the bitwidth can still vary?)



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+

Review Comment:
   I would try to add some introductory sentence here before showing the image 
(because now the first text here after the image also starts with "Data can be 
be represented .. ", but which / what kind of data"?)
   
   Something like "Apache Arrow focuses on tabular data. Consider the following 
table:" (although still is maybe not enough content to warrant a line .. ;))



##########
docs/source/format/images/dense-union-diagram.svg:
##########


Review Comment:
   I think the validity buffer for child 0 is wrong? The child array has only 3 
elements, but the two `1`s in the bitmap are too far apart (more than 3)



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others more efficient because the CPU can maintain memory 
locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of operations on vector data using a single CPU
+instruction.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+
+Support for null values
+=======================
+
+Arrow supports missing values or "nulls" for all data types: any value
+in an array may be semantically null, whether primitive or nested type.
+
+In Arrow, a dedicated buffer, known as the validity (or "null") bitmap,
+is used alongside the data indicating whether each value in the array is
+null or not: a value of 1
+means that the value is not-null ("valid"), whereas a value of 0 indicates 
that the value
+is null.
+
+This validity bitmap is optional: if there are no missing values in
+the array the buffer does not need to be allocated (as in the example
+column 1 in the diagram below).
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are, for example, signed and unsigned
+integer types, floating point numbers, boolean, decimal and temporal
+types.
+
+.. figure:: ./images/primitive-diagram.svg
+   :alt: Diagram is showing the difference between the primitive data
+         type presented in a Table and the data actually stored in
+         computer memory.
+
+   Physical layout diagram for primitive data types.
+
+.. note::
+   Boolean data type is represented with a primitive layout where the
+   values are encoded in bits instead of bytes. That means the physical
+   layout includes a values bitmap buffer and possibly a validity bitmap
+   buffer.
+
+   .. figure:: ./images/bool-diagram.svg
+      :alt: Diagram is showing the difference between the boolean data
+            type presented in a Table and the data actually stored in
+            computer memory.
+
+      Physical layout diagram for boolean data type.
+
+.. note::
+   Arrow also has a concept of Null type where all values are null. In
+   this case no buffers are allocated.
+
+Variable length binary and string
+---------------------------------
+
+The bytes of a binary or string column are stored together consecutively

Review Comment:
   ```suggestion
   The bytes of all elements in a binary or string column are stored together 
consecutively
   ```



##########
docs/source/format/Intro.rst:
##########
@@ -57,58 +57,22 @@ are adjacent in the computer memory:
 
 In a columnar format, on the other hand, the data is organised by column
 instead of by row making analytical operations like filtering, grouping,
-aggregations and others much more efficient. CPU can maintain memory locality
+aggregations and others more efficient because the CPU can maintain memory 
locality
 and require less memory jumps to process the data. By keeping the data 
contiguous
 in memory it also enables vectorization of the computations. Most modern
 CPUs have single instructions, multiple data (SIMD) enabling parallel
-processing and execution of instructions on vector data in single CPU
-instructions.
+processing and execution of operations on vector data using a single CPU
+instruction.
 
 .. figure:: ./images/columnar-diagram_3.svg
    :alt: Tabular data being structured column by column in computer memory.
 
-Overview of Arrow Terminology
-=============================
-
-**Physical layout**
-A specification for how to arrange values of an array in memory.
-
-**Buffer**
-A contiguous region of memory with a given length. Buffers are used to store 
data for arrays.
-
-**Array**
-A contiguous, one-dimensional sequence of values with known length where all 
values have the
-same type. An array consists of zero or more buffers.
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.

Review Comment:
   Can maybe also use "data type" (instead of just "type") to follow the rename 
from logical type to data type



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,458 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others more efficient because the CPU can maintain memory 
locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of operations on vector data using a single CPU
+instruction.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies among
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+
+Support for null values
+=======================
+
+Arrow supports missing values or "nulls" for all data types: any value
+in an array may be semantically null, whether primitive or nested type.
+
+In Arrow, a dedicated buffer, known as the validity (or "null") bitmap,
+is used alongside the data indicating whether each value in the array is
+null or not: a value of 1
+means that the value is not-null ("valid"), whereas a value of 0 indicates 
that the value
+is null.
+
+This validity bitmap is optional: if there are no missing values in
+the array the buffer does not need to be allocated (as in the example
+column 1 in the diagram below).
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are, for example, signed and unsigned
+integer types, floating point numbers, boolean, decimal and temporal
+types.
+
+.. figure:: ./images/primitive-diagram.svg
+   :alt: Diagram is showing the difference between the primitive data
+         type presented in a Table and the data actually stored in
+         computer memory.
+
+   Physical layout diagram for primitive data types.
+
+.. note::
+   Boolean data type is represented with a primitive layout where the
+   values are encoded in bits instead of bytes. That means the physical
+   layout includes a values bitmap buffer and possibly a validity bitmap
+   buffer.
+
+   .. figure:: ./images/bool-diagram.svg
+      :alt: Diagram is showing the difference between the boolean data
+            type presented in a Table and the data actually stored in
+            computer memory.
+
+      Physical layout diagram for boolean data type.
+
+.. note::
+   Arrow also has a concept of Null type where all values are null. In
+   this case no buffers are allocated.
+
+Variable length binary and string
+---------------------------------
+
+The bytes of a binary or string column are stored together consecutively
+in a single buffer or region of memory. To know where each element of the
+column starts and ends the physical layout also includes integer offsets.
+The number of elements of the offset buffer is one more than the length of the
+array as the last two elements define the start and the end of the last
+element in the binary/string column.
+
+Binary and string types share the same physical layout. The one difference
+between them is that the string type is utf-8 binary and will produce an
+invalid result if the bytes are not valid utf-8.
+
+The difference between binary/string and large binary/string is in the offset
+type. In the first case that is int32 and in the second it is int64.
+
+The limitation of types using 32 bit offsets is that they have a max size of
+2GB per array. One can still use the non-large variants for bigger data, but
+then multiple chunks are needed.
+
+.. figure:: ./images/var-string-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string data type presented in a Table and the data actually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string data types.
+
+Variable length binary and string view
+--------------------------------------
+
+This layout is an alternative for the variable length binary layout and is 
adapted from TU Munich's `UmbraDB`_ and is similar to the string
+layout used in `DuckDB`_ and `Velox`_ (and sometimes also called "German style 
strings").
+
+.. _UmbraDB: https://umbra-db.com/
+.. _DuckDB: https://duckdb.com
+.. _Velox: https://velox-lib.io/
+The main differences to classical binary and string layout is the views buffer.
+It includes the length of the string, and then either contains the characters
+inline (for small strings) or only the first 4 bytes of the string and an 
offset into one of
+potentially several data buffers. It also supports binary and strings to be 
written
+out of order.

Review Comment:
   ```suggestion
   potentially several data buffers. Because it uses an offset and length to 
refer to the data buffer, the bytes of all elements do not need to be stored 
together consecutively in one buffer, and thus it supports the bytes to be 
written
   out of order.
   ```
   
   Trying to clarify the big difference of not having one contiguous + 
consecutive data buffer.



##########
docs/source/format/Intro.rst:
##########
@@ -0,0 +1,454 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving problems related to
+in-memory analytical data processing. This includes such topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others much more efficient. CPU can maintain memory locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of instructions on vector data in single CPU
+instructions.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+The column is called an ``Array`` in Arrow terminology. Arrays can be of
+different types and the way their values are stored in memory varies between
+types. The specification of how these values are arranged in memory is what we
+call a ``physical memory layout``. One contiguous region of memory that stores
+data for arrays is called a ``Buffer``.
+
+
+Support for null values
+=======================
+
+Arrow supports missing values or "nulls" for all data types: any value
+in an array may be semantically null, whether primitive or nested type.
+
+In Arrow, a dedicated buffer, known as the validity (or "null") bitmap,
+is used alongside the data indicating whether each value in the array is
+null or not. You can think of it as vector of 0 and 1 values, where a 1
+means that the value is not-null ("valid"), while a 0 indicates the value
+is null.
+
+This validity bitmap is optional, i.e. if there are no missing values in
+the array the buffer does not need to be allocated (as in the example
+column 1 in the diagram below).
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are for example signed and unsigned
+integer types, floating point numbers, boolean, decimal and temporal
+types.
+
+.. figure:: ./images/primitive-diagram.svg
+   :alt: Diagram is showing the difference between the primitive data
+         type presented in a Table and the data actually stored in
+         computer memory.
+
+   Physical layout diagram for primitive data types.
+
+.. note::
+   Boolean data type is represented with a primitive layout where the
+   values are encoded in bits instead of bytes. That means the physical
+   layout includes a values bitmap buffer and possibly a validity bitmap
+   buffer.
+
+   .. figure:: ./images/bool-diagram.svg
+      :alt: Diagram is showing the difference between the boolean data
+            type presented in a Table and the data actually stored in
+            computer memory.
+
+      Physical layout diagram for boolean data type.
+
+.. note::
+   Arrow also has a concept of Null type where all values are null. In
+   this case no memory buffers are allocated.
+
+Variable length binary and string
+---------------------------------
+
+The bytes of a binary or string column are stored together consecutively
+in a single buffer or region of memory. To know where each element of the
+column starts and ends the physical layout also includes integer offsets.
+The length of the offset buffer is one more than the length of the values
+buffer as the last two elements define the start and the end of the last
+element in the binary/string column.
+
+Binary and string types share the same physical layout. The one difference
+between them is that the string type is utf-8 binary and will produce an
+invalid result if the bytes are not valid utf-8.
+
+The difference between binary/string and large binary/string is in the offset
+type. In the first case that is int32 and in the second it is int64.
+
+The limitation of types using 32 bit offsets is that they have a max size of
+2GB per array. One can still use the non-large variants for bigger data, but
+then multiple chunks are needed.
+
+.. figure:: ./images/var-string-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string data type presented in a Table and the data actually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string data types.
+
+Variable length binary and string view
+--------------------------------------
+
+This layout is an alternative for the variable length binary layout and is 
adapted from TU Munich's `UmbraDB`_ and is similar to the string
+layout used in `DuckDB`_ and `Velox`_ (and sometimes also called "German style 
strings").
+
+.. _UmbraDB: https://umbra-db.com/
+.. _DuckDB: https://duckdb.com
+.. _Velox: https://velox-lib.io/
+The main differences to classical binary and string layout is the views buffer.
+It includes the length of the string, and then either contains the characters
+inline (for small strings) or only the first 4 bytes of the string and point 
to a location in one of
+potentially several data buffers. It also supports binary and strings to be 
written
+out of order.
+
+These properties are important for efficient string processing. The prefix
+enables a profitable fast path for string comparisons, which are frequently
+determined within the first four bytes. Selecting elements is a simple "take"
+operation on the fixed-width views buffer and does not need to rewrite the
+values buffers.
+
+.. figure:: ./images/var-string-view-diagram.svg
+   :alt: Diagram is showing the difference between the variable length
+         string view data type presented in a Table and the dataactually
+         stored in computer memory.
+
+   Physical layout diagram for variable length string view data type.
+
+Nested layouts
+==============
+
+Nested types introduce the concept of parent and child arrays. They express
+relationships between physical value arrays in a nested type structure.
+
+Nested types depend on one or more other child data types. For instance, List
+is a nested type (parent) that has one child (the data types of the values in
+the list).
+
+List
+----
+
+The list type enables values of the same type being stacked together in a
+sequence of values in each column slot. The layout is similar to binary or

Review Comment:
   Could also use "sequence" instead of "collection" as is used in the 
Fixed-size list section below (or update it there to use "collection")



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-41673: [Format][Docs] Add arrow format introductory page [arrow]

Reply via email to