jorisvandenbossche commented on code in PR #41593: URL: https://github.com/apache/arrow/pull/41593#discussion_r1637702896
########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. Review Comment: Two comments here (cc @felipecrv as I think you suggested this paragraph): - If we mention inter-process communication here, should we also mention zero-copy within-processing sharing? (i.e. what the C Data Interface provides). Also, I know we generally say about the IPC protocol that it is zero copy, but of course it's not entirely zero copy, so mentioning it twice in context of IPC is maybe a bit too much - I find the mention of "more efficient reading and writing of file formats" a bit out of place now, because it's not really the format itself that enables to read eg a Parquet file more efficiently? (it's that many Arrow implementation will provide this functionality as well, and that we can more easily reuse such implementations if they read into Arrow format) ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. Review Comment: Maybe you can add a caption here as well for those figures (like you do for the ones below) ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being Review Comment: ```suggestion `least-significant bit numbering <https://en.wikipedia.org/wiki/Bit_numbering>`_ being ``` ("bit-endianness" is the generic term about the concept, but doesn't say if its then little endian or big endian) ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string Review Comment: The format spec actually uses "variable-size" instead of "length", so should probably also stick to that terminology to be consistent. ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends the physical layout also includes integer offsets. +The number of elements of the offset buffer is one more than the length of the +array as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string data types share the same physical layout. The one difference +between them is that the string data type is utf-8 binary and assumes to contain +utf-8 encoded strings. + +The difference between binary/string and large binary/string is in the offset +data type. In the first case that is int32 and in the second it is int64. + +The limitation of data types using 32 bit offsets is that they have a max size of +2GB per array. One can still use the non-large variants for bigger data, but +then multiple chunks are needed. + +.. figure:: ./images/var-string-diagram.svg + :alt: Diagram is showing the difference between the variable length + string data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string data types. + +Variable length binary and string view +-------------------------------------- + +This layout is an alternative for the variable length binary layout and is adapted +from TU Munich's `UmbraDB`_ and is similar to the string layout used in `DuckDB`_ and +`Velox`_ (and sometimes also called "German style strings"). + +.. _UmbraDB: https://umbra-db.com/ +.. _DuckDB: https://duckdb.com +.. _Velox: https://velox-lib.io/ +The main differences to the classical binary and string layout is the views buffer. +It includes the length of the string, and then either contains the characters +inline (for small strings) or only the first 4 bytes of the string and an offset into +one of potentially several data buffers. Because it uses an offset and length to refer +to the data buffer, the bytes of all elements do not need to be stored together +consecutively in one buffer, and thus it supports the bytes to be written out of order. + +These properties are important for efficient string processing. The prefix +enables a profitable fast path for string comparisons, which are frequently +determined within the first four bytes. Selecting elements is a simple "take" +operation on the fixed-width views buffer and does not need to rewrite the +values buffers. + +.. figure:: ./images/var-string-view-diagram.svg + :alt: Diagram is showing the difference between the variable length + string view data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string view data type. + +Nested layouts +============== + +Nested data types introduce the concept of parent and child arrays. They express +relationships between physical value arrays in a nested data type structure. + +Nested data types depend on one or more other child data types. For instance, List +is a nested data type (parent) that has one child (the data types of the values in +the list). + +List +---- + +The list data type enables representing an array where each element is a sequence +of elements of the same data type. The layout is similar to variable-size binary +or string layout as it has an offsets buffer to define where the sequence of values +for each element starts and ends, with all the values being stored consecutively +in a values child array. + +The offsets in the list data type are int32 while in the large list the offsets +are int64. + +.. figure:: ./images/var-list-diagram.svg + :alt: Diagram is showing the difference between the variable size + list data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable size list data type. + +Fixed size list +--------------- + +Fixed size list is a special case of variable-size list where each column slot +contains a fixed size sequence meaning all lists are the same size and so the +offset buffer is no longer needed. + +.. figure:: ./images/fixed-list-diagram.svg + :alt: Diagram is showing the difference between the fixed size list data + type presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for fixed size list data type. + +List and large list view +------------------------ + +List view data type allows arrays to specify out-of-order offsets. Review Comment: Can you expand this explanation a bit? (I think the main point is that in addition to the offsets buffer, there is now also a sizes buffer. The offsets still indicate the start of each element, but the size is not inferred from the next offset value, but now coded explicitly in a separate sizes buffer. That allows to have out-of-order offsets. ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends the physical layout also includes integer offsets. +The number of elements of the offset buffer is one more than the length of the +array as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string data types share the same physical layout. The one difference +between them is that the string data type is utf-8 binary and assumes to contain +utf-8 encoded strings. + +The difference between binary/string and large binary/string is in the offset +data type. In the first case that is int32 and in the second it is int64. + +The limitation of data types using 32 bit offsets is that they have a max size of Review Comment: ```suggestion The limitation of data types using 32 bit offsets is that they have a maximum size of ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. Review Comment: I would explicitly call out here that this is depicted that way in the diagrams ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the Review Comment: ```suggestion The boolean data type is represented with a primitive layout where the ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends the physical layout also includes integer offsets. +The number of elements of the offset buffer is one more than the length of the +array as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string data types share the same physical layout. The one difference +between them is that the string data type is utf-8 binary and assumes to contain +utf-8 encoded strings. + +The difference between binary/string and large binary/string is in the offset +data type. In the first case that is int32 and in the second it is int64. + +The limitation of data types using 32 bit offsets is that they have a max size of +2GB per array. One can still use the non-large variants for bigger data, but +then multiple chunks are needed. + +.. figure:: ./images/var-string-diagram.svg + :alt: Diagram is showing the difference between the variable length + string data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string data types. + +Variable length binary and string view +-------------------------------------- + +This layout is an alternative for the variable length binary layout and is adapted +from TU Munich's `UmbraDB`_ and is similar to the string layout used in `DuckDB`_ and +`Velox`_ (and sometimes also called "German style strings"). + +.. _UmbraDB: https://umbra-db.com/ +.. _DuckDB: https://duckdb.com +.. _Velox: https://velox-lib.io/ +The main differences to the classical binary and string layout is the views buffer. +It includes the length of the string, and then either contains the characters +inline (for small strings) or only the first 4 bytes of the string and an offset into +one of potentially several data buffers. Because it uses an offset and length to refer +to the data buffer, the bytes of all elements do not need to be stored together +consecutively in one buffer, and thus it supports the bytes to be written out of order. + +These properties are important for efficient string processing. The prefix +enables a profitable fast path for string comparisons, which are frequently +determined within the first four bytes. Selecting elements is a simple "take" +operation on the fixed-width views buffer and does not need to rewrite the +values buffers. + +.. figure:: ./images/var-string-view-diagram.svg + :alt: Diagram is showing the difference between the variable length + string view data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string view data type. + +Nested layouts +============== + +Nested data types introduce the concept of parent and child arrays. They express +relationships between physical value arrays in a nested data type structure. + +Nested data types depend on one or more other child data types. For instance, List +is a nested data type (parent) that has one child (the data types of the values in +the list). + +List +---- + +The list data type enables representing an array where each element is a sequence +of elements of the same data type. The layout is similar to variable-size binary +or string layout as it has an offsets buffer to define where the sequence of values +for each element starts and ends, with all the values being stored consecutively +in a values child array. + +The offsets in the list data type are int32 while in the large list the offsets +are int64. + +.. figure:: ./images/var-list-diagram.svg + :alt: Diagram is showing the difference between the variable size + list data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable size list data type. + +Fixed size list +--------------- + +Fixed size list is a special case of variable-size list where each column slot Review Comment: ```suggestion Fixed-size list is a special case of variable-size list where each column slot ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together Review Comment: The previous section for fixed-size primitive starts with > A primitive column represents an array of values where each value has the same physical size measured in bytes. so you start here in a similar way, something like "In contrast to the fixed-size layout, the variable-size layout allows to represent an array where each element can have a variable size in bytes. This layout is used for binary and string data." Then it's also clearer what "the bytes of all elements" are exactly referring to. ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. Review Comment: I would add here something like (to connect the array with buffer concepts): "An array consists of one or more buffers" ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends the physical layout also includes integer offsets. +The number of elements of the offset buffer is one more than the length of the +array as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string data types share the same physical layout. The one difference +between them is that the string data type is utf-8 binary and assumes to contain +utf-8 encoded strings. + +The difference between binary/string and large binary/string is in the offset +data type. In the first case that is int32 and in the second it is int64. + +The limitation of data types using 32 bit offsets is that they have a max size of +2GB per array. One can still use the non-large variants for bigger data, but +then multiple chunks are needed. + +.. figure:: ./images/var-string-diagram.svg + :alt: Diagram is showing the difference between the variable length + string data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string data types. + +Variable length binary and string view +-------------------------------------- + +This layout is an alternative for the variable length binary layout and is adapted +from TU Munich's `UmbraDB`_ and is similar to the string layout used in `DuckDB`_ and +`Velox`_ (and sometimes also called "German style strings"). + +.. _UmbraDB: https://umbra-db.com/ +.. _DuckDB: https://duckdb.com +.. _Velox: https://velox-lib.io/ +The main differences to the classical binary and string layout is the views buffer. +It includes the length of the string, and then either contains the characters +inline (for small strings) or only the first 4 bytes of the string and an offset into +one of potentially several data buffers. Because it uses an offset and length to refer +to the data buffer, the bytes of all elements do not need to be stored together +consecutively in one buffer, and thus it supports the bytes to be written out of order. + +These properties are important for efficient string processing. The prefix +enables a profitable fast path for string comparisons, which are frequently +determined within the first four bytes. Selecting elements is a simple "take" +operation on the fixed-width views buffer and does not need to rewrite the +values buffers. + +.. figure:: ./images/var-string-view-diagram.svg + :alt: Diagram is showing the difference between the variable length + string view data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string view data type. + +Nested layouts +============== + +Nested data types introduce the concept of parent and child arrays. They express +relationships between physical value arrays in a nested data type structure. + +Nested data types depend on one or more other child data types. For instance, List +is a nested data type (parent) that has one child (the data types of the values in Review Comment: ```suggestion is a nested data type (parent) that has one child (the data type of the values in ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values Review Comment: It's indeed very inconsistent... I think that many newer parts of the docs actually use sentence-case (my personal preference if for sentence-case, but foremost we should be consistent) ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends the physical layout also includes integer offsets. +The number of elements of the offset buffer is one more than the length of the +array as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string data types share the same physical layout. The one difference +between them is that the string data type is utf-8 binary and assumes to contain +utf-8 encoded strings. + +The difference between binary/string and large binary/string is in the offset +data type. In the first case that is int32 and in the second it is int64. + +The limitation of data types using 32 bit offsets is that they have a max size of +2GB per array. One can still use the non-large variants for bigger data, but +then multiple chunks are needed. + +.. figure:: ./images/var-string-diagram.svg + :alt: Diagram is showing the difference between the variable length + string data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string data types. + +Variable length binary and string view +-------------------------------------- + +This layout is an alternative for the variable length binary layout and is adapted +from TU Munich's `UmbraDB`_ and is similar to the string layout used in `DuckDB`_ and +`Velox`_ (and sometimes also called "German style strings"). + +.. _UmbraDB: https://umbra-db.com/ +.. _DuckDB: https://duckdb.com +.. _Velox: https://velox-lib.io/ +The main differences to the classical binary and string layout is the views buffer. +It includes the length of the string, and then either contains the characters +inline (for small strings) or only the first 4 bytes of the string and an offset into +one of potentially several data buffers. Because it uses an offset and length to refer +to the data buffer, the bytes of all elements do not need to be stored together +consecutively in one buffer, and thus it supports the bytes to be written out of order. + +These properties are important for efficient string processing. The prefix +enables a profitable fast path for string comparisons, which are frequently +determined within the first four bytes. Selecting elements is a simple "take" +operation on the fixed-width views buffer and does not need to rewrite the +values buffers. + +.. figure:: ./images/var-string-view-diagram.svg + :alt: Diagram is showing the difference between the variable length + string view data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string view data type. + +Nested layouts +============== + +Nested data types introduce the concept of parent and child arrays. They express +relationships between physical value arrays in a nested data type structure. + +Nested data types depend on one or more other child data types. For instance, List +is a nested data type (parent) that has one child (the data types of the values in +the list). + +List +---- + +The list data type enables representing an array where each element is a sequence +of elements of the same data type. The layout is similar to variable-size binary +or string layout as it has an offsets buffer to define where the sequence of values +for each element starts and ends, with all the values being stored consecutively +in a values child array. + +The offsets in the list data type are int32 while in the large list the offsets +are int64. + +.. figure:: ./images/var-list-diagram.svg + :alt: Diagram is showing the difference between the variable size + list data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable size list data type. + +Fixed size list +--------------- + +Fixed size list is a special case of variable-size list where each column slot +contains a fixed size sequence meaning all lists are the same size and so the +offset buffer is no longer needed. + +.. figure:: ./images/fixed-list-diagram.svg + :alt: Diagram is showing the difference between the fixed size list data + type presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for fixed size list data type. + +List and large list view +------------------------ + +List view data type allows arrays to specify out-of-order offsets. + +.. figure:: ./images/var-list-view-diagram.svg + :alt: Diagram is showing the difference between the variable size list view + data type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for variable size list view data type. + +Struct +------ + +A struct is a nested data type parameterized by an ordered sequence of data types. Review Comment: ```suggestion A struct is a nested data type parameterized by an ordered sequence of fields (a data types and a name). ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends the physical layout also includes integer offsets. +The number of elements of the offset buffer is one more than the length of the +array as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string data types share the same physical layout. The one difference +between them is that the string data type is utf-8 binary and assumes to contain +utf-8 encoded strings. + +The difference between binary/string and large binary/string is in the offset +data type. In the first case that is int32 and in the second it is int64. + +The limitation of data types using 32 bit offsets is that they have a max size of +2GB per array. One can still use the non-large variants for bigger data, but +then multiple chunks are needed. + +.. figure:: ./images/var-string-diagram.svg + :alt: Diagram is showing the difference between the variable length + string data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string data types. + +Variable length binary and string view +-------------------------------------- + +This layout is an alternative for the variable length binary layout and is adapted +from TU Munich's `UmbraDB`_ and is similar to the string layout used in `DuckDB`_ and +`Velox`_ (and sometimes also called "German style strings"). + +.. _UmbraDB: https://umbra-db.com/ +.. _DuckDB: https://duckdb.com +.. _Velox: https://velox-lib.io/ +The main differences to the classical binary and string layout is the views buffer. +It includes the length of the string, and then either contains the characters +inline (for small strings) or only the first 4 bytes of the string and an offset into +one of potentially several data buffers. Because it uses an offset and length to refer +to the data buffer, the bytes of all elements do not need to be stored together +consecutively in one buffer, and thus it supports the bytes to be written out of order. + +These properties are important for efficient string processing. The prefix +enables a profitable fast path for string comparisons, which are frequently +determined within the first four bytes. Selecting elements is a simple "take" +operation on the fixed-width views buffer and does not need to rewrite the +values buffers. + +.. figure:: ./images/var-string-view-diagram.svg + :alt: Diagram is showing the difference between the variable length + string view data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string view data type. + +Nested layouts +============== + +Nested data types introduce the concept of parent and child arrays. They express +relationships between physical value arrays in a nested data type structure. + +Nested data types depend on one or more other child data types. For instance, List +is a nested data type (parent) that has one child (the data types of the values in +the list). + +List +---- + +The list data type enables representing an array where each element is a sequence +of elements of the same data type. The layout is similar to variable-size binary +or string layout as it has an offsets buffer to define where the sequence of values +for each element starts and ends, with all the values being stored consecutively +in a values child array. + +The offsets in the list data type are int32 while in the large list the offsets +are int64. + +.. figure:: ./images/var-list-diagram.svg + :alt: Diagram is showing the difference between the variable size + list data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable size list data type. + +Fixed size list +--------------- + +Fixed size list is a special case of variable-size list where each column slot +contains a fixed size sequence meaning all lists are the same size and so the +offset buffer is no longer needed. + +.. figure:: ./images/fixed-list-diagram.svg + :alt: Diagram is showing the difference between the fixed size list data + type presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for fixed size list data type. + +List and large list view +------------------------ + +List view data type allows arrays to specify out-of-order offsets. + +.. figure:: ./images/var-list-view-diagram.svg + :alt: Diagram is showing the difference between the variable size list view + data type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for variable size list view data type. + +Struct +------ + +A struct is a nested data type parameterized by an ordered sequence of data types. + +* There is one child array for each field +* Child arrays are independent and need not be adjacent to each other in + memory (only need to have the same length) + +One can think of an individual struct field as a key-value pair where the +key is the field name and the child array its values. The field (key) is +saved in the schema and the values of a specific field (key) are saved in +the child array. + +.. figure:: ./images/struct-diagram.svg + :alt: Diagram is showing the difference between the struct data type + presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for struct data type. + +Map +--- + +The Map data type represents nested data where each value is a variable number of +key-value pairs. Its physical representation is the same as a list of ``{key, value}`` +structs. + +The difference between the struct and map data types is that a struct holds the key +in the schema, requiring keys to be strings, and the values are stored in in the Review Comment: ```suggestion in the schema, requiring keys to be strings, and the values are stored in the ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,485 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +As the format gets more adoption, it becomes easier for data processing +systems to exchange tabular data. Among other things, an agreed upon +in-memory format, enables the implementations of zero-copy IPC protocols +(inter-process communication without copying data in memory) and +more efficient reading and writing of file formats like CSV, `Apache ORC`_, +and `Apache Parquet`_. + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data so let's consider we have data +which are tabular so they can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +This kind of data can be represented in memory using a row based format or a +column based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organized by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others more efficient because the CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being + used. + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends the physical layout also includes integer offsets. +The number of elements of the offset buffer is one more than the length of the +array as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string data types share the same physical layout. The one difference +between them is that the string data type is utf-8 binary and assumes to contain +utf-8 encoded strings. + +The difference between binary/string and large binary/string is in the offset +data type. In the first case that is int32 and in the second it is int64. + +The limitation of data types using 32 bit offsets is that they have a max size of +2GB per array. One can still use the non-large variants for bigger data, but +then multiple chunks are needed. + +.. figure:: ./images/var-string-diagram.svg + :alt: Diagram is showing the difference between the variable length + string data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string data types. + +Variable length binary and string view +-------------------------------------- + +This layout is an alternative for the variable length binary layout and is adapted +from TU Munich's `UmbraDB`_ and is similar to the string layout used in `DuckDB`_ and +`Velox`_ (and sometimes also called "German style strings"). + +.. _UmbraDB: https://umbra-db.com/ +.. _DuckDB: https://duckdb.com +.. _Velox: https://velox-lib.io/ +The main differences to the classical binary and string layout is the views buffer. +It includes the length of the string, and then either contains the characters +inline (for small strings) or only the first 4 bytes of the string and an offset into +one of potentially several data buffers. Because it uses an offset and length to refer +to the data buffer, the bytes of all elements do not need to be stored together +consecutively in one buffer, and thus it supports the bytes to be written out of order. + +These properties are important for efficient string processing. The prefix +enables a profitable fast path for string comparisons, which are frequently +determined within the first four bytes. Selecting elements is a simple "take" +operation on the fixed-width views buffer and does not need to rewrite the +values buffers. + +.. figure:: ./images/var-string-view-diagram.svg + :alt: Diagram is showing the difference between the variable length + string view data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string view data type. + +Nested layouts +============== + +Nested data types introduce the concept of parent and child arrays. They express +relationships between physical value arrays in a nested data type structure. + +Nested data types depend on one or more other child data types. For instance, List +is a nested data type (parent) that has one child (the data types of the values in +the list). + +List +---- + +The list data type enables representing an array where each element is a sequence +of elements of the same data type. The layout is similar to variable-size binary +or string layout as it has an offsets buffer to define where the sequence of values +for each element starts and ends, with all the values being stored consecutively +in a values child array. + +The offsets in the list data type are int32 while in the large list the offsets +are int64. + +.. figure:: ./images/var-list-diagram.svg + :alt: Diagram is showing the difference between the variable size + list data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable size list data type. + +Fixed size list +--------------- + +Fixed size list is a special case of variable-size list where each column slot +contains a fixed size sequence meaning all lists are the same size and so the +offset buffer is no longer needed. + +.. figure:: ./images/fixed-list-diagram.svg + :alt: Diagram is showing the difference between the fixed size list data + type presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for fixed size list data type. + +List and large list view +------------------------ + +List view data type allows arrays to specify out-of-order offsets. + +.. figure:: ./images/var-list-view-diagram.svg + :alt: Diagram is showing the difference between the variable size list view + data type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for variable size list view data type. + +Struct +------ + +A struct is a nested data type parameterized by an ordered sequence of data types. + +* There is one child array for each field +* Child arrays are independent and need not be adjacent to each other in + memory (only need to have the same length) + +One can think of an individual struct field as a key-value pair where the +key is the field name and the child array its values. The field (key) is +saved in the schema and the values of a specific field (key) are saved in +the child array. + +.. figure:: ./images/struct-diagram.svg + :alt: Diagram is showing the difference between the struct data type + presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for struct data type. + +Map +--- + +The Map data type represents nested data where each value is a variable number of +key-value pairs. Its physical representation is the same as a list of ``{key, value}`` +structs. + +The difference between the struct and map data types is that a struct holds the key +in the schema, requiring keys to be strings, and the values are stored in in the +child arrays, +one for each field. There can be multiple keys and therefore multiple child arrays. +The map, on the other hand, has one child array holding all the different keys (that +thus all need to be of the same data type, but not necessarily strings) and a second +child array holding all the values. The values need to be of the same data type; however, +the data type doesn't have to match that of the keys. + +Also, the map stores the struct in a list and needs an offset as the list is +variable shape. + +.. figure:: ./images/map-diagram.svg + :alt: Diagram is showing the difference between the map data type + presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for map data type. + +Union +----- + +The union is a nested data type where each slot in the union has a value with a data type +chosen from a subset of possible Arrow data types. That means that a union array represents +a mixed-type array. Unlike other data types, unions do not have their own validity bitmap +and the nullness is determined by the child arrays. + +Arrow defines two distinct union data types, "dense" and "sparse". + +Dense Union +^^^^^^^^^^^ + +A Dense Union has one child array for each data type present in the mixed-type array and +two buffers of its own: + +* **Types buffer:** holds data type id for each slot of the array. Data type id is + frequently the index of the child array; however, the relationship between data type + ID and the child index is a parameter of the data type. +* **Offsets buffer:** holds relative offset into the respective child array for each + array slot. + +.. figure:: ./images/dense-union-diagram.svg + :alt: Diagram is showing the difference between the dense union data type + presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for dense union data type. + +Sparse union +^^^^^^^^^^^^ + +A sparse union has the same structure as a dense union, with the omission of the offsets +buffer. In this case, the child arrays are each equal in length to the length of the union. + + +.. figure:: ./images/sparse-union-diagram.svg + :alt: Diagram is showing the difference between the sparse union data type + presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for sparse union data type. + +Dictionary Encoded Layout +========================= + +Dictionary encoding can be effective when one has data with many repeated values. +The values are represented by integers referencing a dictionary usually consisting of +unique values. + +.. figure:: ./images/dictionary-diagram.svg + :alt: Diagram is showing the difference between the dictionary data type + presented in a Table and the data actually stored in computer + memory. + + Physical layout diagram for dictionary data type. + +Run-End Encoded Layout +====================== + +Run-end encoding is well-suited for representing data containing sequences of the +same value. These sequences are called runs. Run-end encoded array has no buffers +of its own, but has two child arrays: + +* **Run ends array:** holds the index in the array where each run ends. The run ends + array always begins with 0 and contains one more element than the length of + its parent array. Review Comment: ```suggestion * **Run ends array:** holds the index in the array where each run ends. The run ends array always begins with 0 and contains one more element than the length of its parent array. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
