tustvold commented on code in PR #247:
URL: https://github.com/apache/arrow-site/pull/247#discussion_r996493330
##########
_posts/2022-10-17-arrow-parquet-encoding-part-3.md:
##########
@@ -0,0 +1,221 @@
+---
+layout: post
+title: "Arrow and Parquet Part 3: Arbitrary Nesting with Lists of Structs and
Structs of Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the third of a three part series exploring how projects such as [Rust
Apache Arrow](https://github.com/apache/arrow-rs) support conversion between
[Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache
Parquet](https://parquet.apache.org/) for efficient storage. [Apache
Arrow](https://arrow.apache.org/) is an open, language-independent columnar
memory format for flat and hierarchical data, organized for efficient analytic
operations. [Apache Parquet](https://parquet.apache.org/) is an open,
column-oriented data file format designed for very efficient data encoding and
retrieval.
+
+
+[Arrow and Parquet Part 1: Primitive Types and
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
covers the basics of primitive types. [Arrow and Parquet Part 2: Nested and
Hierarchical Data using Structs and
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/)
covers the `Struct` and `List` types, and now this post gives an example of
how both formats combine the topics to support arbitrary nesting.
Review Comment:
```suggestion
[Arrow and Parquet Part 1: Primitive Types and
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
covered the basics of primitive types. [Arrow and Parquet Part 2: Nested and
Hierarchical Data using Structs and
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/)
covered the `Struct` and `List` types. This post builds on this foundation to
show how both formats combine these to support arbitrary nesting.
```
##########
_posts/2022-10-17-arrow-parquet-encoding-part-3.md:
##########
@@ -0,0 +1,221 @@
+---
+layout: post
+title: "Arrow and Parquet Part 3: Arbitrary Nesting with Lists of Structs and
Structs of Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the third of a three part series exploring how projects such as [Rust
Apache Arrow](https://github.com/apache/arrow-rs) support conversion between
[Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache
Parquet](https://parquet.apache.org/) for efficient storage. [Apache
Arrow](https://arrow.apache.org/) is an open, language-independent columnar
memory format for flat and hierarchical data, organized for efficient analytic
operations. [Apache Parquet](https://parquet.apache.org/) is an open,
column-oriented data file format designed for very efficient data encoding and
retrieval.
+
+
+[Arrow and Parquet Part 1: Primitive Types and
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
covers the basics of primitive types. [Arrow and Parquet Part 2: Nested and
Hierarchical Data using Structs and
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/)
covers the `Struct` and `List` types, and now this post gives an example of
how both formats combine the topics to support arbitrary nesting.
+
+Some libraries, such as Rust [parquet](https://crates.io/crates/parquet)
implementation, offer complete support for such combinations, and users of
those libraries do not need to worry about these details except to satisfy
their own curiosity. Other libraries may not handle some corner cases and this
post gives some flavor of why it is so complicated to do so.
+
+
+# Structs with Lists
+Consider the following three json documents
+
+```json
+{ <-- First record
+ “a”: [1], <-- top-level field a containing list of integers
+ “b”: [ <-- top-level field b containing list of structures
+ { <-- list element of b containing two field b1 and b2
+ “b1”: 1 <-- b1 is always provided (non nullable)
+ },
+ {
+ “b1”: 1,
+ “b2”: [ <-- b2 contains list of integers
+ 3, 4 <-- list elements of b.b2 always provided (non nullable)
+ ]
+ }
+ ]
+}
+```
+
+```json
+{
+ “b”: [ <-- b is always provided (non nullable)
+ {
+ “b1”: 2
+ },
+ ]
+}
+```
+
+```json
+{
+ “a”: [null, null], <-- list elements of a are nullable
+ “b”: [null] <-- list elements of b are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+ Field(name: “element”, nullable: true, datatype: Int32),
+)
+Field(name: “b”), nullable: false, datatype: List(
+ Field(name: “element”, nullable: true, datatype: Struct[
+ Field(name: “b1”, nullable: false, datatype: Int32),
+ Field(name: “b2”, nullable: true, datatype: List(
+ Field(name: “element”, nullable: false, datatype: Int32)
+ ))
+ ])
+))
+```
+
+
+As explained previously, Arrow chooses to represent this in a hierarchical
fashion. `StructArray`s are stored as child arrays that contain each field of
the struct. `ListArray`s are stored as lists of monotonically increasing
integers called offsets, and values are stored in a single child array. Each
consecutive pair of elements in the offset array identifies a slice of the
child array for that array index.
Review Comment:
```suggestion
As explained previously, Arrow chooses to represent this in a hierarchical
fashion. `StructArray`s are stored as child arrays that contain each field of
the struct. `ListArray`s are stored as lists of monotonically increasing
integers called offsets, with values stored in a single child array. Each
consecutive pair of elements in the offset array identifies a slice of the
child array for that array index.
```
##########
_posts/2022-10-17-arrow-parquet-encoding-part-3.md:
##########
@@ -0,0 +1,221 @@
+---
+layout: post
+title: "Arrow and Parquet Part 3: Arbitrary Nesting with Lists of Structs and
Structs of Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the third of a three part series exploring how projects such as [Rust
Apache Arrow](https://github.com/apache/arrow-rs) support conversion between
[Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache
Parquet](https://parquet.apache.org/) for efficient storage. [Apache
Arrow](https://arrow.apache.org/) is an open, language-independent columnar
memory format for flat and hierarchical data, organized for efficient analytic
operations. [Apache Parquet](https://parquet.apache.org/) is an open,
column-oriented data file format designed for very efficient data encoding and
retrieval.
+
+
+[Arrow and Parquet Part 1: Primitive Types and
Nullability](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/)
covers the basics of primitive types. [Arrow and Parquet Part 2: Nested and
Hierarchical Data using Structs and
Lists](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/)
covers the `Struct` and `List` types, and now this post gives an example of
how both formats combine the topics to support arbitrary nesting.
+
+Some libraries, such as Rust [parquet](https://crates.io/crates/parquet)
implementation, offer complete support for such combinations, and users of
those libraries do not need to worry about these details except to satisfy
their own curiosity. Other libraries may not handle some corner cases and this
post gives some flavor of why it is so complicated to do so.
+
+
+# Structs with Lists
+Consider the following three json documents
+
+```json
+{ <-- First record
+ “a”: [1], <-- top-level field a containing list of integers
+ “b”: [ <-- top-level field b containing list of structures
+ { <-- list element of b containing two field b1 and b2
+ “b1”: 1 <-- b1 is always provided (non nullable)
+ },
+ {
+ “b1”: 1,
+ “b2”: [ <-- b2 contains list of integers
+ 3, 4 <-- list elements of b.b2 always provided (non nullable)
+ ]
+ }
+ ]
+}
+```
+
+```json
+{
+ “b”: [ <-- b is always provided (non nullable)
+ {
+ “b1”: 2
+ },
+ ]
+}
+```
+
+```json
+{
+ “a”: [null, null], <-- list elements of a are nullable
+ “b”: [null] <-- list elements of b are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+ Field(name: “element”, nullable: true, datatype: Int32),
+)
+Field(name: “b”), nullable: false, datatype: List(
+ Field(name: “element”, nullable: true, datatype: Struct[
+ Field(name: “b1”, nullable: false, datatype: Int32),
+ Field(name: “b2”, nullable: true, datatype: List(
+ Field(name: “element”, nullable: false, datatype: Int32)
+ ))
+ ])
+))
+```
+
+
+As explained previously, Arrow chooses to represent this in a hierarchical
fashion. `StructArray`s are stored as child arrays that contain each field of
the struct. `ListArray`s are stored as lists of monotonically increasing
integers called offsets, and values are stored in a single child array. Each
consecutive pair of elements in the offset array identifies a slice of the
child array for that array index.
+
+The Arrow encoding of the example would be:
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+ ┌──────────────────┐
+│ ┌─────┐ ┌─────┐ │ ┌─────┐ ┌─────┐│ │
+ │ 1 │ │ 0 │ │ │ 1 │ │ 1 ││
+│ ├─────┤ ├─────┤ │ ├─────┤ ├─────┤│ │
+ │ 0 │ │ 1 │ │ │ 0 │ │ ?? ││
+│ ├─────┤ ├─────┤ │ ├─────┤ ├─────┤│ │
+ │ 1 │ │ 1 │ │ │ 0 │ │ ?? ││
+│ └─────┘ ├─────┤ │ └─────┘ └─────┘│ │
+ │ 3 │ │ Validity Values│
+│ Validity └─────┘ │ │ │
+ │ child[0] │
+│ "a" Offsets │ PrimitiveArray │ │
+ ListArray └──────────────────┘
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │
+│ ┌──────────┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ┌─────┐ │ ┌─────┐ │ ┌─────┐ │ ┌─────┐ ┌─────┐ ┌──────────┐ │ │ │
+│ │ 0 │ │ 1 │ │ │ 1 │ │ │ │ 0 │ │ 0 │ │ ┌─────┐ │
+ ├─────┤ │ ├─────┤ │ ├─────┤ │ ├─────┤ ├─────┤ │ │ 3 │ │ │ │ │
+│ │ 2 │ │ 1 │ │ │ 1 │ │ │ │ 1 │ │ 0 │ │ ├─────┤ │
+ ├─────┤ │ ├─────┤ │ ├─────┤ │ ├─────┤ ├─────┤ │ │ 4 │ │ │ │ │
+│ │ 3 │ │ 1 │ │ │ 2 │ │ │ │ 0 │ │ 2 │ │ └─────┘ │
+ ├─────┤ │ ├─────┤ │ ├─────┤ │ ├─────┤ ├─────┤ │ │ │ │ │
+│ │ 4 │ │ 0 │ │ │ ?? │ │ │ │ ?? │ │ 2 │ │ Values │
+ └─────┘ │ └─────┘ │ └─────┘ │ └─────┘ ├─────┤ │ │ │ │ │
+│ │ │ │ │ 2 │ │ │
+ Offsets │ Validity│ Values │ └─────┘ │ │ │ │ │
+│ │ │ │Validity │ child[0] │
+ │ │ "b1" │ Offsets │ Primitive│ │ │ │
+│ │ Primitive│ │ "b2" │ Array │
+ │ │ Array │ ListArray └──────────┘ │ │ │
+│ └──────────┘ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │ "element" │ │
+│ StructArray
+ "b" └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │
+│ ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+Documents of this format could be stored in this Parquet schema
+
+```text
+message schema {
+ optional group a (LIST) {
+ repeated group list {
+ optional int32 element;
+ }
+ }
+ required group b (LIST) {
+ repeated group list {
+ optional group element {
+ required int32 b1;
+ optional group b2 (LIST) {
+ repeated group list {
+ required int32 element;
+ }
+ }
+ }
+ }
+ }
+}
+```
+
+As explained in our previous posts, Parquet uses repetition levels and
definition levels to encode nested structures and nullability.
+
+Definition and repetition levels is a non trivial topic. For more detail, you
can read the [Google Dremel Paper](https://research.google/pubs/pub36632/)
which is typically cited as the inspiration for Parquet repetition and
definition levels, and offers an academic description of the algorithm. You can
also explore this
[gist](https://gist.github.com/alamb/acd653c49e318ff70672b61325ba3443) to see
Rust [parquet](https://crates.io/crates/parquet) code which generates the
example below.
Review Comment:
```suggestion
Definition and repetition levels is a non trivial topic. For more detail,
you can read the [Google Dremel Paper](https://research.google/pubs/pub36632/)
which offers an academic description of the algorithm. You can also explore
this [gist](https://gist.github.com/alamb/acd653c49e318ff70672b61325ba3443) to
see Rust [parquet](https://crates.io/crates/parquet) code which generates the
example below.
```
Parquet explicitly uses dremel encoding
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]