Re: [PR] GH-41673: [Format][Docs] Add arrow format introductory page [arrow]

via GitHub Thu, 16 May 2024 03:20:29 -0700


jorisvandenbossche commented on code in PR #41593:
URL: https://github.com/apache/arrow/pull/41593#discussion_r1603050353



##########
docs/source/format/FormatIntro.rst:
##########
@@ -0,0 +1,519 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving systems problems
+related to in-memory analytical data processing. This includes such
+topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others much more efficient. CPU can maintain memory locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of instructions on vector data in single CPU
+instructions.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+Compression is another element where columnar format representation can
+take high advantage. Data similarity allows for better compression
+techniques and algorithms. Having the same data types locality allows
+to have better compression ratios.
+
+Primitive layouts

Review Comment:
   I think before starting here with explaining the primitive layout, we should 
explain what we mean with a "layout"?



##########
docs/source/format/FormatIntro.rst:
##########
@@ -0,0 +1,519 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving systems problems

Review Comment:
   I am not familiar enough with this exact terminology, but would remove the 
"systems" part (not sure why this would be specifically "system problems")



##########
docs/source/format/FormatIntro.rst:
##########
@@ -0,0 +1,519 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving systems problems
+related to in-memory analytical data processing. This includes such
+topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others much more efficient. CPU can maintain memory locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of instructions on vector data in single CPU
+instructions.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+Compression is another element where columnar format representation can
+take high advantage. Data similarity allows for better compression
+techniques and algorithms. Having the same data types locality allows
+to have better compression ratios.

Review Comment:
   Should we mention this explicitly, given that we don't actually have 
compression capabilities in the in-memory format itself? 
   (we do have it in the IPC spec, though)



##########
docs/source/format/FormatIntro.rst:
##########
@@ -0,0 +1,519 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving systems problems
+related to in-memory analytical data processing. This includes such
+topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others much more efficient. CPU can maintain memory locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of instructions on vector data in single CPU
+instructions.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+Compression is another element where columnar format representation can
+take high advantage. Data similarity allows for better compression
+techniques and algorithms. Having the same data types locality allows
+to have better compression ratios.
+
+Primitive layouts
+=================
+
+Fixed Size Primitive Layout
+---------------------------
+
+A primitive column represents an array of values where each value
+has the same physical size measured in bytes. Data types that share the
+same fixed size primitive layout are for example signed and unsigned
+integer types, floating point numbers, boolean, decimal and temporal
+types.
+
+Support for null values
+-----------------------

Review Comment:
   I don't have a direct suggestion of how to solve it, but because "Support 
for null values" header is nested under the "Primitive layouts"  header, that 
might give the wrong impression that this is only for primitive layouts?



##########
docs/source/format/FormatIntro.rst:
##########


Review Comment:
   Small nitpick, but I would just call this "Intro.rst" (we could later add a 
brief into to other parts of the spec as well, like a brief explainer of what 
IPC is), that will also look better in the resulting url



##########
docs/source/format/FormatIntro.rst:
##########
@@ -0,0 +1,519 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+*****************************************
+Introduction to the Arrow Columnar Format
+*****************************************
+
+Apache Arrow was born with the idea to define a set of standards for
+data representation and interchange between languages and systems to
+avoid costs of data serialization/deserialization and in order to
+avoid reinventing the wheel in each of those systems and languages.
+
+Each system or language requires their own format definitions, implementation
+of common algorithms, etcetera. In our heterogeneous environments we
+often have to move data from one system or language to accommodate our
+workflows that meant copy and convert the data between them, which is
+quite costly.
+
+Apart from this initial vision, Arrow has grown to also develop a
+multi-language collection of libraries for solving systems problems
+related to in-memory analytical data processing. This includes such
+topics as:
+
+* Zero-copy shared memory and RPC-based data movement
+* Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache 
Parquet`_)
+* In-memory analytics and query processing
+
+.. _Apache ORC: https://orc.apache.org/
+.. _Apache Parquet: https://parquet.apache.org/
+
+Arrow Columnar Format
+=====================
+
+.. figure:: ./images/columnar-diagram_1.svg
+   :scale: 70%
+   :alt: Diagram with tabular data of 4 rows and columns.
+
+Data can be represented in memory using a row based format or a column
+based format. The row based format stores data by row meaning the rows
+are adjacent in the computer memory:
+
+.. figure:: ./images/columnar-diagram_2.svg
+   :alt: Tabular data being structured row by row in computer memory.
+
+In a columnar format, on the other hand, the data is organised by column
+instead of by row making analytical operations like filtering, grouping,
+aggregations and others much more efficient. CPU can maintain memory locality
+and require less memory jumps to process the data. By keeping the data 
contiguous
+in memory it also enables vectorization of the computations. Most modern
+CPUs have single instructions, multiple data (SIMD) enabling parallel
+processing and execution of instructions on vector data in single CPU
+instructions.
+
+.. figure:: ./images/columnar-diagram_3.svg
+   :alt: Tabular data being structured column by column in computer memory.
+
+Compression is another element where columnar format representation can
+take high advantage. Data similarity allows for better compression
+techniques and algorithms. Having the same data types locality allows
+to have better compression ratios.
+
+Primitive layouts

Review Comment:
   And reading a bit further, other terms are used without them really being 
introduced, like "buffer" (eg, we could first have a generic explanation of an 
Array of a certain type having a certain physical layout, which describes the 
buffers)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-41673: [Format][Docs] Add arrow format introductory page [arrow]

Reply via email to