amoeba commented on code in PR #41593: URL: https://github.com/apache/arrow/pull/41593#discussion_r1775897665
########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,511 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data Review Comment: ```suggestion The adoption of these standards reduces computing costs of data ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,511 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +Apart from this initial vision, Arrow has grown to also develop a +multi-language collection of libraries for solving problems related to +in-memory analytical data processing. This covers topics like: + +* Zero-copy shared memory and RPC-based data movement +* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache Parquet`_) +* In-memory analytics and query processing + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data. For an example, let's consider +we have data that can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + + Diagram of a tabular data structure. + +Tabular data can be represented in memory using a row-based format or a +column-based format. The row-based format stores data row-by-row, meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + + Tabular data being saved in memory row by row. + +In a columnar format, the data is organized column-by-column instead. +This organization makes analytical operations like filtering, grouping, +aggregations and others, more efficient thanks to memory locality. +When processing the data, the memory locations accessed by the CPU tend +be near one another. By keeping the data contiguous in memory, it also +enables vectorization of the computations. Most modern +CPUs have +[SIMD instructions](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) +(a single instruction that operates on multiple values at once) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. + +Apache Arrow is solving this exact problem. It is the specification that +uses the columnar layout. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + + The same tabular data being saved in memory column by column. + +Each column is called an **Array** in Arrow terminology. Arrays can be of +different data types and the way their values are stored in memory varies among +the data types. The specification of how these values are arranged in memory is +what we call a **physical memory layout**. One contiguous region of memory that +stores data for arrays is called a **Buffer**. An array consists of one or more +buffers. + +Next sections give an introduction to Arrow Columnar Format explaining the +different physical layouts. The full specification of the format can be found +at :ref:`format_columnar`. + +Support for Null Values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested data type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not: a value of 1 means that the value is not-null ("valid"), whereas +a value of 0 indicates that the value is null. + +This validity bitmap is optional: if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +.. note:: + + We read validity bitmaps right-to-left within a group of 8 bits due to + `least-significant bit numbering <https://en.wikipedia.org/wiki/Bit_numbering>`_ + being used. + + This is also the how we represented the validity bitmaps in the + diagrams included in this document. + +Primitive Layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that use the +fixed size primitive layout are, for example, signed and unsigned +integer data types, floating point numbers, boolean, decimal and temporal +data types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + The boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null data type where all values are null. In + this case no buffers are allocated. + +Variable length binary and string +--------------------------------- + +In contrast to the fixed size primitive layout, the variable length layout +allows representing an array where each element can have a variable size +in bytes. This layout is used for binary and string data. + +The bytes of all elements in a binary or string column are stored together +consecutively in a single buffer or region of memory. To know where each element +of the column starts and ends, the physical layout also includes integer offsets. +The offsets buffer is always one element longer than the array. +The last two offsets define the start and the end of the last +binary/string element. + +Binary and string data types share the same physical layout. The only +difference between them is that a string-typed array is assumed to contain +valid utf-8 string data. Review Comment: ```suggestion valid UTF-8 string data. ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,511 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. Review Comment: ```suggestion vast set of algorithms for in-memory analytical data processing. More detail about how implementations differ can be found on the :ref:`status` page. ``` ########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,511 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +************ +Introduction +************ + +Apache Arrow was born from the need for a set of standards around +tabular data representation and interchange between systems. +The adoption of these standards reduce computing costs of data +serialization/deserialization and implementation costs across +systems implemented in different programming languages. + +The Apache Arrow specification can be implemented in any programming +language but official implementations for many languages are available. +An implementation consists of format definitions using the constructs +offered by the language and common in-memory data processing algorithms +(e.g. slicing and concatenating). Users can extend and use the utilities +provided by the Apache Arrow implementation in their programming +language of choice. Some implementations are further ahead and feature a +vast set of algorithms for in-memory analytical data processing. + +Apart from this initial vision, Arrow has grown to also develop a +multi-language collection of libraries for solving problems related to +in-memory analytical data processing. This covers topics like: + +* Zero-copy shared memory and RPC-based data movement +* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache Parquet`_) +* In-memory analytics and query processing + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +Apache Arrow focuses on tabular data. For an example, let's consider +we have data that can be organized into a table: + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + + Diagram of a tabular data structure. + +Tabular data can be represented in memory using a row-based format or a +column-based format. The row-based format stores data row-by-row, meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + + Tabular data being saved in memory row by row. + +In a columnar format, the data is organized column-by-column instead. +This organization makes analytical operations like filtering, grouping, +aggregations and others, more efficient thanks to memory locality. +When processing the data, the memory locations accessed by the CPU tend +be near one another. By keeping the data contiguous in memory, it also +enables vectorization of the computations. Most modern +CPUs have +[SIMD instructions](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) +(a single instruction that operates on multiple values at once) enabling parallel +processing and execution of operations on vector data using a single CPU +instruction. Review Comment: Fixes this link: ```suggestion enables vectorization of the computations. Most modern CPUs have `SIMD instructions`_ (a single instruction that operates on multiple values at once) enabling parallel processing and execution of operations on vector data using a single CPU instruction. .. _SIMD instructions: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
