tustvold commented on code in PR #247:
URL: https://github.com/apache/arrow-site/pull/247#discussion_r996493330


##########
_posts/2022-10-17-arrow-parquet-encoding-part-3.md:
##########
@@ -0,0 +1,221 @@
+---
+layout: post
+title: "Arrow and Parquet Part 3: Arbitrary Nesting with Lists of Structs and 
Structs of Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the third of a three part series exploring how projects such as [Rust 
Apache Arrow](https://github.com/apache/arrow-rs) support conversion between 
[Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache 
Parquet](https://parquet.apache.org/) for efficient storage. [Apache 
Arrow](https://arrow.apache.org/) is an open, language-independent columnar 
memory format for flat and hierarchical data, organized for efficient analytic 
operations. [Apache Parquet](https://parquet.apache.org/) is an open, 
column-oriented data file format designed for very efficient data encoding and 
retrieval.
+
+
+[Arrow and Parquet Part 1: Primitive Types and 
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
 covers the basics of primitive types.  [Arrow and Parquet Part 2: Nested and 
Hierarchical Data using Structs and 
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/) 
covers the `Struct` and `List` types,  and now this post gives an example of 
how both formats combine the topics to support arbitrary nesting.

Review Comment:
   ```suggestion
   [Arrow and Parquet Part 1: Primitive Types and 
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
 covered the basics of primitive types.  [Arrow and Parquet Part 2: Nested and 
Hierarchical Data using Structs and 
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/) 
covered the `Struct` and `List` types. This post builds on this foundation to 
show how both formats combine these to support arbitrary nesting.
   ```



##########
_posts/2022-10-17-arrow-parquet-encoding-part-3.md:
##########
@@ -0,0 +1,221 @@
+---
+layout: post
+title: "Arrow and Parquet Part 3: Arbitrary Nesting with Lists of Structs and 
Structs of Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the third of a three part series exploring how projects such as [Rust 
Apache Arrow](https://github.com/apache/arrow-rs) support conversion between 
[Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache 
Parquet](https://parquet.apache.org/) for efficient storage. [Apache 
Arrow](https://arrow.apache.org/) is an open, language-independent columnar 
memory format for flat and hierarchical data, organized for efficient analytic 
operations. [Apache Parquet](https://parquet.apache.org/) is an open, 
column-oriented data file format designed for very efficient data encoding and 
retrieval.
+
+
+[Arrow and Parquet Part 1: Primitive Types and 
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
 covers the basics of primitive types.  [Arrow and Parquet Part 2: Nested and 
Hierarchical Data using Structs and 
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/) 
covers the `Struct` and `List` types,  and now this post gives an example of 
how both formats combine the topics to support arbitrary nesting.
+
+Some libraries, such as Rust [parquet](https://crates.io/crates/parquet) 
implementation, offer complete support for such combinations, and users of 
those libraries do not need to worry about these details except to satisfy 
their own curiosity. Other libraries may not handle some corner cases and this 
post gives some flavor of why it is so complicated to do so.
+
+
+# Structs with Lists
+Consider the following three json documents
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+  “b”: [              <-- top-level field b containing list of structures
+    {                 <-- list element of b containing two field b1 and b2
+      “b1”: 1         <-- b1 is always provided (non nullable)
+    },
+    {
+      “b1”: 1,
+      “b2”: [         <-- b2 contains list of integers
+        3, 4          <-- list elements of b.b2 always provided (non nullable)
+      ]
+    }
+  ]
+}
+```
+
+```json
+{
+  “b”: [              <-- b is always provided (non nullable)
+    {
+      “b1”: 2
+    },
+  ]
+}
+```
+
+```json
+{
+  “a”: [null, null],  <-- list elements of a are nullable
+  “b”: [null]         <-- list elements of b are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+Field(name: “b”), nullable: false, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Struct[
+    Field(name: “b1”, nullable: false, datatype: Int32),
+    Field(name: “b2”, nullable: true, datatype: List(
+      Field(name: “element”, nullable: false, datatype: Int32)
+    ))
+  ])
+))
+```
+
+
+As explained previously, Arrow chooses to represent this in a hierarchical 
fashion.  `StructArray`s are stored as child arrays that contain each field of 
the struct.  `ListArray`s are stored as lists of monotonically increasing 
integers called offsets, and values are stored in a single child array. Each 
consecutive pair of elements in the offset array identifies a slice of the 
child array for that array index.

Review Comment:
   ```suggestion
   As explained previously, Arrow chooses to represent this in a hierarchical 
fashion. `StructArray`s are stored as child arrays that contain each field of 
the struct.  `ListArray`s are stored as lists of monotonically increasing 
integers called offsets, with values stored in a single child array. Each 
consecutive pair of elements in the offset array identifies a slice of the 
child array for that array index.
   ```



##########
_posts/2022-10-17-arrow-parquet-encoding-part-3.md:
##########
@@ -0,0 +1,221 @@
+---
+layout: post
+title: "Arrow and Parquet Part 3: Arbitrary Nesting with Lists of Structs and 
Structs of Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the third of a three part series exploring how projects such as [Rust 
Apache Arrow](https://github.com/apache/arrow-rs) support conversion between 
[Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache 
Parquet](https://parquet.apache.org/) for efficient storage. [Apache 
Arrow](https://arrow.apache.org/) is an open, language-independent columnar 
memory format for flat and hierarchical data, organized for efficient analytic 
operations. [Apache Parquet](https://parquet.apache.org/) is an open, 
column-oriented data file format designed for very efficient data encoding and 
retrieval.
+
+
+[Arrow and Parquet Part 1: Primitive Types and 
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
 covers the basics of primitive types.  [Arrow and Parquet Part 2: Nested and 
Hierarchical Data using Structs and 
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/) 
covers the `Struct` and `List` types,  and now this post gives an example of 
how both formats combine the topics to support arbitrary nesting.
+
+Some libraries, such as Rust [parquet](https://crates.io/crates/parquet) 
implementation, offer complete support for such combinations, and users of 
those libraries do not need to worry about these details except to satisfy 
their own curiosity. Other libraries may not handle some corner cases and this 
post gives some flavor of why it is so complicated to do so.
+
+
+# Structs with Lists
+Consider the following three json documents
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+  “b”: [              <-- top-level field b containing list of structures
+    {                 <-- list element of b containing two field b1 and b2
+      “b1”: 1         <-- b1 is always provided (non nullable)
+    },
+    {
+      “b1”: 1,
+      “b2”: [         <-- b2 contains list of integers
+        3, 4          <-- list elements of b.b2 always provided (non nullable)
+      ]
+    }
+  ]
+}
+```
+
+```json
+{
+  “b”: [              <-- b is always provided (non nullable)
+    {
+      “b1”: 2
+    },
+  ]
+}
+```
+
+```json
+{
+  “a”: [null, null],  <-- list elements of a are nullable
+  “b”: [null]         <-- list elements of b are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+Field(name: “b”), nullable: false, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Struct[
+    Field(name: “b1”, nullable: false, datatype: Int32),
+    Field(name: “b2”, nullable: true, datatype: List(
+      Field(name: “element”, nullable: false, datatype: Int32)
+    ))
+  ])
+))
+```
+
+
+As explained previously, Arrow chooses to represent this in a hierarchical 
fashion.  `StructArray`s are stored as child arrays that contain each field of 
the struct.  `ListArray`s are stored as lists of monotonically increasing 
integers called offsets, and values are stored in a single child array. Each 
consecutive pair of elements in the offset array identifies a slice of the 
child array for that array index.
+
+The Arrow encoding of the example would be:
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+                     ┌──────────────────┐
+│ ┌─────┐   ┌─────┐  │ ┌─────┐   ┌─────┐│ │
+  │  1  │   │  0  │  │ │  1  │   │  1  ││
+│ ├─────┤   ├─────┤  │ ├─────┤   ├─────┤│ │
+  │  0  │   │  1  │  │ │  0  │   │ ??  ││
+│ ├─────┤   ├─────┤  │ ├─────┤   ├─────┤│ │
+  │  1  │   │  1  │  │ │  0  │   │ ??  ││
+│ └─────┘   ├─────┤  │ └─────┘   └─────┘│ │
+            │  3  │  │ Validity   Values│
+│ Validity  └─────┘  │                  │ │
+                     │ child[0]         │
+│ "a"       Offsets  │ PrimitiveArray   │ │
+  ListArray          └──────────────────┘
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+           ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │
+│                    ┌──────────┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌─────┐  │ ┌─────┐ │ ┌─────┐  │   ┌─────┐ ┌─────┐ ┌──────────┐ │ │ │
+│ │  0  │    │  1  │ │ │  1  │  │ │ │  0  │ │  0  │ │ ┌─────┐  │
+  ├─────┤  │ ├─────┤ │ ├─────┤  │   ├─────┤ ├─────┤ │ │  3  │  │ │ │ │
+│ │  2  │    │  1  │ │ │  1  │  │ │ │  1  │ │  0  │ │ ├─────┤  │
+  ├─────┤  │ ├─────┤ │ ├─────┤  │   ├─────┤ ├─────┤ │ │  4  │  │ │ │ │
+│ │  3  │    │  1  │ │ │  2  │  │ │ │  0  │ │  2  │ │ └─────┘  │
+  ├─────┤  │ ├─────┤ │ ├─────┤  │   ├─────┤ ├─────┤ │          │ │ │ │
+│ │  4  │    │  0  │ │ │ ??  │  │ │ │ ??  │ │  2  │ │  Values  │
+  └─────┘  │ └─────┘ │ └─────┘  │   └─────┘ ├─────┤ │          │ │ │ │
+│                    │          │ │         │  2  │ │          │
+  Offsets  │ Validity│ Values   │           └─────┘ │          │ │ │ │
+│                    │          │ │Validity         │ child[0] │
+           │         │ "b1"     │           Offsets │ Primitive│ │ │ │
+│                    │ Primitive│ │ "b2"            │ Array    │
+           │         │ Array    │   ListArray       └──────────┘ │ │ │
+│                    └──────────┘ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+           │ "element"                                             │ │
+│            StructArray
+  "b"      └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │
+│ ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+Documents of this format could be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+  required group b (LIST) {
+    repeated group list {
+      optional group element {
+        required int32 b1;
+        optional group b2 (LIST) {
+          repeated group list {
+            required int32 element;
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+As explained in our previous posts, Parquet uses repetition levels and 
definition levels to encode nested structures and nullability.
+
+Definition and repetition levels is a non trivial topic. For more detail, you 
can read the [Google Dremel Paper](https://research.google/pubs/pub36632/) 
which is typically cited as the inspiration for Parquet repetition and 
definition levels, and offers an academic description of the algorithm. You can 
also explore this 
[gist](https://gist.github.com/alamb/acd653c49e318ff70672b61325ba3443) to see 
Rust [parquet](https://crates.io/crates/parquet) code which generates the 
example below.

Review Comment:
   ```suggestion
   Definition and repetition levels is a non trivial topic. For more detail, 
you can read the [Google Dremel Paper](https://research.google/pubs/pub36632/) 
which offers an academic description of the algorithm. You can also explore 
this [gist](https://gist.github.com/alamb/acd653c49e318ff70672b61325ba3443) to 
see Rust [parquet](https://crates.io/crates/parquet) code which generates the 
example below.
   ```
   
   Parquet explicitly uses dremel encoding



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to