This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/master by this push:
new be4e5fc04c [MINOR] Improve formatting of json examples + links in
nested parquet blogs (#256)
be4e5fc04c is described below
commit be4e5fc04cf908b4f3c1251b6bf84b47b13f6f10
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Oct 17 17:26:48 2022 -0400
[MINOR] Improve formatting of json examples + links in nested parquet blogs
(#256)
* [MINOR]: Improve diagram markdown formatting
* Tweak markdown and add links
* Tweak
---
_posts/2022-10-05-arrow-parquet-encoding-part-1.md | 6 +--
_posts/2022-10-08-arrow-parquet-encoding-part-2.md | 62 +++++++++++-----------
2 files changed, 34 insertions(+), 34 deletions(-)
diff --git a/_posts/2022-10-05-arrow-parquet-encoding-part-1.md
b/_posts/2022-10-05-arrow-parquet-encoding-part-1.md
index 33d791ad92..a4688a7af8 100644
--- a/_posts/2022-10-05-arrow-parquet-encoding-part-1.md
+++ b/_posts/2022-10-05-arrow-parquet-encoding-part-1.md
@@ -44,7 +44,7 @@ First, it is necessary to take a step back and discuss the
difference between co
For example
-```json
+```python
{"Column1": 1, "Column2": 2}
{"Column1": 3, "Column2": 4, "Column3": 5}
{"Column1": 5, "Column2": 4, "Column3": 5}
@@ -52,7 +52,7 @@ For example
In a columnar representation, the data for a given column is instead stored
contiguously
-```text
+```python
Column1: [1, 3, 5]
Column2: [2, 4, 4]
Column3: [null, 5, 5]
@@ -147,4 +147,4 @@ Definition Values
## Next up: Nested and Hierarchical Data
-Armed with the foundational understanding of how Arrow and Parquet store
nullability / definition differently we are ready to move on to more complex
nested types, which you can read about in our upcoming blog post on the topic
<!-- I propose to update this text with a link when when we have published the
next blog -->.
+Armed with the foundational understanding of how Arrow and Parquet store
nullability / definition differently we are ready to move on to more complex
nested types, which you can read about in our [next blog post on the
topic](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/).
diff --git a/_posts/2022-10-08-arrow-parquet-encoding-part-2.md
b/_posts/2022-10-08-arrow-parquet-encoding-part-2.md
index f871c68530..c88b13eb25 100644
--- a/_posts/2022-10-08-arrow-parquet-encoding-part-2.md
+++ b/_posts/2022-10-08-arrow-parquet-encoding-part-2.md
@@ -37,25 +37,25 @@ Both Parquet and Arrow have the concept of a *struct*
column, which is a column
For example, consider the following three JSON documents
-```json
-{ <-- First record
- "a": 1, <-- the top level fields are a, b, c, and d
- "b": { <-- b is always provided (not nullable)
- "b1": 1, <-- b1 and b2 are "nested" fields of "b"
- "b2": 3 <-- b2 is always provided (not nullable)
+```python
+{ # <-- First record
+ "a": 1, # <-- the top level fields are a, b, c, and d
+ "b": { # <-- b is always provided (not nullable)
+ "b1": 1, # <-- b1 and b2 are "nested" fields of "b"
+ "b2": 3 # <-- b2 is always provided (not nullable)
},
"d": {
- "d1": 1 <-- d1 is a "nested" field of "d"
+ "d1": 1 # <-- d1 is a "nested" field of "d"
}
}
```
-```json
-{ <-- Second record
+```python
+{ # <-- Second record
"a": 2,
"b": {
- "b2": 4 <-- note "b1" is NULL in this record
+ "b2": 4 # <-- note "b1" is NULL in this record
},
- "c": { <-- note "c" was NULL in the first record
+ "c": { # <-- note "c" was NULL in the first record
"c1": 6 but when "c" is provided, c1 is also
}, always provided (not nullable)
"d": {
@@ -64,8 +64,8 @@ For example, consider the following three JSON documents
}
}
```
-```json
-{ <-- Third record
+```python
+{ # <-- Third record
"b": {
"b1": 5,
"b2": 6
@@ -77,7 +77,7 @@ For example, consider the following three JSON documents
```
Documents of this format could be stored in an Arrow `StructArray` with this
schema
-```text
+```python
Field(name: "a", nullable: true, datatype: Int32)
Field(name: "b", nullable: false, datatype: Struct[
Field(name: "b1", nullable: true, datatype: Int32),
@@ -144,14 +144,14 @@ For example consider the case of `d.d2`, which contains
two nullable levels `d`
A definition level of `0` would imply a null at the level of `d`:
-```json
+```python
{
}
```
A definition level of `1` would imply a null at the level of `d`
-```json
+```python
{
"d": { null }
}
@@ -159,7 +159,7 @@ A definition level of `1` would imply a null at the level
of `d`
A definition level of `2` would imply a defined value for `d.d2`:
-```json
+```python
{
"d": { "d2": .. }
}
@@ -168,7 +168,7 @@ A definition level of `2` would imply a defined value for
`d.d2`:
Going back to the three JSON documents above, they could be stored in Parquet
with this schema
-```text
+```python
message schema {
optional int32 a;
required group b {
@@ -230,29 +230,29 @@ The Parquet encoding of the example would be:
Closing out support for nested types are *lists*, which contain a variable
number of other values. For example, the following four documents each have a
(nullable) field `a` containing a list of integers
-```json
-{ <-- First record
- "a": [1], <-- top-level field a containing list of integers
+```python
+{ # <-- First record
+ "a": [1], # <-- top-level field a containing list of integers
}
```
-```json
-{ <-- "a" is not provided (is null)
+```python
+{ # <-- "a" is not provided (is null)
}
```
-```json
-{ <-- "a" is non-null but empty
+```python
+{ # <-- "a" is non-null but empty
"a": []
}
```
-```json
+```python
{
- "a": [null, 2], <-- "a" has a null and non-null elements
+ "a": [null, 2], # <-- "a" has a null and non-null elements
}
```
Documents of this format could be stored in this Arrow schema
-```text
+```python
Field(name: "a", nullable: true, datatype: List(
Field(name: "element", nullable: true, datatype: Int32),
)
@@ -262,7 +262,7 @@ As before, Arrow chooses to represent this in a
hierarchical fashion as a `ListA
For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets,
`(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length
3 with the following values:
-```text
+```python
0: [child[0], child[1]]
1: []
2: [child[2]]
@@ -299,7 +299,7 @@ More technical detail is available in the [ListArray format
specification](https
The example above with 4 JSON documents can be stored in this Parquet schema
-```text
+```python
message schema {
optional group a (LIST) {
repeated group list {
@@ -343,6 +343,6 @@ The example above would therefore be encoded as
## Next up: Arbitrary Nesting: Lists of Structs and Structs of Lists
-In our final blog post <!-- When published, add link here --> we will explain
how Parquet and Arrow combine these concepts to support arbitrary nesting of
potentially nullable data structures.
+In our [final blog
post](https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/),
we explain how Parquet and Arrow combine these concepts to support arbitrary
nesting of potentially nullable data structures.
If you want to store and process structured types, you will be pleased to hear
that the Rust [parquet](https://crates.io/crates/parquet) implementation fully
supports reading and writing directly into Arrow, as simply as any other type.
All the complex record shredding and reconstruction is handled automatically.
With this and other exciting features such as [reading
asynchronously](https://docs.rs/parquet/22.0.0/parquet/arrow/async_reader/index.html)
from [object storage](https://docs. [...]