AlenkaF commented on code in PR #41593: URL: https://github.com/apache/arrow/pull/41593#discussion_r1632668766
########## docs/source/format/Intro.rst: ########## @@ -0,0 +1,454 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +***************************************** +Introduction to the Arrow Columnar Format +***************************************** + +Apache Arrow was born with the idea to define a set of standards for +data representation and interchange between languages and systems to +avoid costs of data serialization/deserialization and in order to +avoid reinventing the wheel in each of those systems and languages. + +Each system or language requires their own format definitions, implementation +of common algorithms, etcetera. In our heterogeneous environments we +often have to move data from one system or language to accommodate our +workflows that meant copy and convert the data between them, which is +quite costly. + +Apart from this initial vision, Arrow has grown to also develop a +multi-language collection of libraries for solving problems related to +in-memory analytical data processing. This includes such topics as: + +* Zero-copy shared memory and RPC-based data movement +* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache Parquet`_) +* In-memory analytics and query processing + +.. _Apache ORC: https://orc.apache.org/ +.. _Apache Parquet: https://parquet.apache.org/ + +Arrow Columnar Format +===================== + +.. figure:: ./images/columnar-diagram_1.svg + :scale: 70% + :alt: Diagram with tabular data of 4 rows and columns. + +Data can be represented in memory using a row based format or a column +based format. The row based format stores data by row meaning the rows +are adjacent in the computer memory: + +.. figure:: ./images/columnar-diagram_2.svg + :alt: Tabular data being structured row by row in computer memory. + +In a columnar format, on the other hand, the data is organised by column +instead of by row making analytical operations like filtering, grouping, +aggregations and others much more efficient. CPU can maintain memory locality +and require less memory jumps to process the data. By keeping the data contiguous +in memory it also enables vectorization of the computations. Most modern +CPUs have single instructions, multiple data (SIMD) enabling parallel +processing and execution of instructions on vector data in single CPU +instructions. + +.. figure:: ./images/columnar-diagram_3.svg + :alt: Tabular data being structured column by column in computer memory. + +The column is called an ``Array`` in Arrow terminology. Arrays can be of +different types and the way their values are stored in memory varies between +types. The specification of how these values are arranged in memory is what we +call a ``physical memory layout``. One contiguous region of memory that stores +data for arrays is called a ``Buffer``. + + +Support for null values +======================= + +Arrow supports missing values or "nulls" for all data types: any value +in an array may be semantically null, whether primitive or nested type. + +In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, +is used alongside the data indicating whether each value in the array is +null or not. You can think of it as vector of 0 and 1 values, where a 1 +means that the value is not-null ("valid"), while a 0 indicates the value +is null. + +This validity bitmap is optional, i.e. if there are no missing values in +the array the buffer does not need to be allocated (as in the example +column 1 in the diagram below). + +Primitive layouts +================= + +Fixed Size Primitive Layout +--------------------------- + +A primitive column represents an array of values where each value +has the same physical size measured in bytes. Data types that share the +same fixed size primitive layout are for example signed and unsigned +integer types, floating point numbers, boolean, decimal and temporal +types. + +.. figure:: ./images/primitive-diagram.svg + :alt: Diagram is showing the difference between the primitive data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for primitive data types. + +.. note:: + Boolean data type is represented with a primitive layout where the + values are encoded in bits instead of bytes. That means the physical + layout includes a values bitmap buffer and possibly a validity bitmap + buffer. + + .. figure:: ./images/bool-diagram.svg + :alt: Diagram is showing the difference between the boolean data + type presented in a Table and the data actually stored in + computer memory. + + Physical layout diagram for boolean data type. + +.. note:: + Arrow also has a concept of Null type where all values are null. In + this case no memory buffers are allocated. + +Variable length binary and string +--------------------------------- + +The bytes of a binary or string column are stored together consecutively +in a single buffer or region of memory. To know where each element of the +column starts and ends the physical layout also includes integer offsets. +The length of the offset buffer is one more than the length of the values +buffer as the last two elements define the start and the end of the last +element in the binary/string column. + +Binary and string types share the same physical layout. The one difference +between them is that the string type is utf-8 binary and will produce an +invalid result if the bytes are not valid utf-8. + +The difference between binary/string and large binary/string is in the offset +type. In the first case that is int32 and in the second it is int64. + +The limitation of types using 32 bit offsets is that they have a max size of +2GB per array. One can still use the non-large variants for bigger data, but +then multiple chunks are needed. + +.. figure:: ./images/var-string-diagram.svg + :alt: Diagram is showing the difference between the variable length + string data type presented in a Table and the data actually + stored in computer memory. + + Physical layout diagram for variable length string data types. + +Variable length binary and string view +-------------------------------------- + +This layout is an alternative for the variable length binary layout and is adapted from TU Munich's `UmbraDB`_ and is similar to the string +layout used in `DuckDB`_ and `Velox`_ (and sometimes also called "German style strings"). + +.. _UmbraDB: https://umbra-db.com/ +.. _DuckDB: https://duckdb.com +.. _Velox: https://velox-lib.io/ +The main differences to classical binary and string layout is the views buffer. +It includes the length of the string, and then either contains the characters +inline (for small strings) or only the first 4 bytes of the string and point to a location in one of +potentially several data buffers. It also supports binary and strings to be written +out of order. + +These properties are important for efficient string processing. The prefix +enables a profitable fast path for string comparisons, which are frequently +determined within the first four bytes. Selecting elements is a simple "take" +operation on the fixed-width views buffer and does not need to rewrite the +values buffers. + +.. figure:: ./images/var-string-view-diagram.svg + :alt: Diagram is showing the difference between the variable length + string view data type presented in a Table and the dataactually + stored in computer memory. + + Physical layout diagram for variable length string view data type. + +Nested layouts +============== + +Nested types introduce the concept of parent and child arrays. They express +relationships between physical value arrays in a nested type structure. + +Nested types depend on one or more other child data types. For instance, List +is a nested type (parent) that has one child (the data types of the values in +the list). + +List +---- + +The list type enables values of the same type being stacked together in a +sequence of values in each column slot. The layout is similar to binary or Review Comment: Will use "The list type enables representing an array where each element is a sequence of elements of the same type". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
