Re: [PR] [BLOG] Variant Blog [parquet-site]

via GitHub Thu, 19 Feb 2026 04:50:17 -0800


alamb commented on code in PR #171:
URL: https://github.com/apache/parquet-site/pull/171#discussion_r2827567679



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction
+
+The Apache Parquet community is excited to announce the addition of the 
**Variant type**—a feature that brings native support for semi-structured data 
to Parquet, significantly improving efficiency compared to less efficient 
formats such as JSON. This marks a significant addition to Parquet, 
demonstrating how the format continues to evolve to meet modern data 
engineering needs.
+
+While Apache Parquet has long been the standard for structured data where each 
value has a fixed and known type, handling heterogeneous, nested data often 
required a compromise: either store it as a costly-to-parse JSON string or 
flatten it into a rigid schema. The introduction of the Variant logical type 
provides a native, high-performance solution for semi-structured data that is 
already seeing rapid uptake across the ecosystem.
+
+---
+
+## What is Variant?
+
+**Variant** is a self-describing data type designed to efficiently store and 
process semi-structured data—JSON-like documents with arbitrary and evolving 
schemas.
+
+---
+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full 
parsing to access any field, making queries slow and resource-intensive, 
Variant solves this by storing data in a **structured binary format** that 
enables direct field access through offset-based navigation. Query engines can 
jump directly to nested fields without deserializing the entire document, 
dramatically improving performance.
+
+Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: It avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.
+
+### Key Benefits
+
+- **Type-Preserving Storage:** Original data types are maintained in their 
native formats—data types (integers, strings, booleans, timestamps, etc.) are 
preserved, unlike JSON which has a limited type system with no native support 
for types like timestamps or integers.
+
+- **Efficient Encoding:** The binary format uses field name deduplication to 
minimize storage overhead compared to JSON strings or BSON encoding.
+
+- **Fast Query Performance:** Direct offset-based field access provides 
performance improvement over JSON string parsing. Optional shredding of 
frequently accessed fields into typed columns further enhances query pruning 
and predicate pushdown.
+
+- **Schema Flexibility:** No predefined schema is required, allowing documents 
with different structures to coexist in the same column. This enables seamless 
schema evolution while maintaining full queryability across all schema 
variations, while still taking advantage of common structures when present.
+
+---
+
+## Overview of Variant Type in Parquet
+
+Parquet introduced the [Variant logical 
type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant)
 in [August 2025](https://github.com/apache/parquet-format/pull/509).
+
+### Variant Encoding
+
+In Parquet, Variant is represented as a logical type and stored physically as 
a struct with two binary fields. The encoding is 
[designed](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
 so engines can efficiently navigate nested structures and extract only the 
fields they need, rather than parsing the entire binary blob.
+
+```parquet
+optional group event_data (VARIANT(1)) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+- **`metadata`:** Encodes type information and shared dictionaries (for 
example, field-name dictionaries for objects). This avoids repeatedly storing 
the same strings and enables efficient navigation.
+- **`value`:** Encodes the actual data in a compact binary form, supporting 
primitive values as well as arrays and objects.
+
+#### Example
+
+A web access event can be stored in a single Variant column while preserving 
the original data types:
+
+```json
+{
+  "userId": 12345,
+  "events": [
+    {"eType": "login", "timestamp": "2026-01-15T10:30:00Z"},
+    {"eType": "purchase", "timestamp": "2026-01-15T11:45:00Z", "amount": 99.99}
+  ]
+}
+```
+
+Compared with storing the same payload as a JSON string, Variant retains type 
information (for example, timestamp values are stored as integers rather than 
being stored as strings), which improves correctness, enables more efficient 
querying and requires fewer bytes to store.
+
+Just as importantly, Variant supports **schema variability**: records with 
different shapes can coexist in the same column without requiring schema 
migrations. For example, the following record can be stored alongside the event 
record above:
+
+```json
+{
+  "userId": 12345,
+  "error": "auth_failure" 
+}
+```
+
+---
+
+## Shredding Encoding
+
+To enhance query performance and storage efficiency, Variant data can be 
**shredded** by extracting frequently accessed fields into separate, 
strongly-typed columns, as described in the [detailed shredding 
specification](https://github.com/apache/parquet-format/blob/master/VariantShredding.md).
 For each shredded field:
+
+- If the field **matches the expected schema**, its value is written to the 
strongly typed field.
+- If the field **does not match**, the original representation is written as 
Variant-encoded binary field and the corresponding strongly typed field is left 
NULL.
+
+![Shredding Variant Visualization](/blog/variant/variant_shredding.png)
+
+The query engine decides which fields to shred based on access patterns and 
workload characteristics. Once shredded, the standard Parquet columnar 
optimizations (encoding, compression, statistics) are used for the typed 
columns.
+
+### Implementation Considerations
+
+- **Schema Inference:** Engines can infer the shredding schema from sample 
data by selecting the most frequently occurring type for each field. For 
example, if `event.id` is predominantly an integer, the engine shreds it to an 
INT64 column.
+
+- **Type Promotion:** To maximize shredding coverage, engines can promote 
types within the same type family. For example, if integer values vary in size 
(INT8, INT32, INT64), selecting INT64 as the shredded type ensures all integer 
values can be shredded rather than falling back to the unshredded 
representation.
+
+- **Metadata Control:** To control metadata overhead, engines may limit the 
number of shredded fields, since each field contributes statistics (min/max 
values, null counts) to the file footer and column stats.
+
+- **Explicit Shredding Schema:** When read patterns are known in advance, 
engines can specify an explicit shredding schema at write time, ensuring that 
frequently accessed fields are shredded for optimal query performance.
+
+### Performance Characteristics
+
+- **Selective field access:** When queries access only the shredded fields, 
only those columns are read from Parquet, skipping the rest, benefiting from 
column pruning and predicate pushdown.
+
+- **Full Variant reconstruction:** When queries require access to the complete 
Variant object, there is a performance overhead as the engine must reconstruct 
the Variant by merging data from the shredded typed fields and the base Variant 
column.
+
+### Examples of Shredded Parquet Schemas
+
+The following example shows shredding non nested Variants. In this case, the 
writer chose to shred String values as the `typed_value` column.  Rows which do 
not contain strings are stored in the `value` column, with the binary variant 
encoding.
+
+```parquet
+optional group SIMPLE_DATA (VARIANT(1)) = 1 { 
+    required binary metadata;           # variant metadata
+    optional binary value;              # non-shredded value   
+    optional binary typed_value (STRING); # the shredded value 
+}
+```
+
+The series of variant values “Jim”, 100,  {“name”: “Jim”} are encoded as:
+
+| Variant Value | `value` | `typed_value` |
+|---------------|---------|---------------|
+| `"Jim"` | `null` | `"Jim"` |
+| `100` | `100` | `null` |
+| `{"name": "Jim"}` | `{"name": "Jim"}` | `null` |
+
+---
+
+Shredding nested variants is similar, with the shredding applied recursively, 
as shown in the following example. In this case, the `userId` field is shredded 
as an integer, and stored as two columns: in `typed_value.userId.typed_value` 
when the value is integer and as a variant in `typed_value.userId.value` 
otherwise. Similarly, the `eType` field is shredded as a string and stored in 
`typed_value.eType.typed_value` and `typed_value.eType.value`.
+```parquet
+optional group EVENT_DATA (VARIANT(1)) = 1 {
+    required binary metadata;           # variant metadata
+    optional binary value;              # non-shredded value   
+    optional group typed_value {
+      required group userId {          # userId field
+        optional binary value;          # non-shredded value
+        optional int32 typed_value;     # the shredded value
+      }
+      required group eType {             # eType field
+        optional binary value;          # non-shredded value
+        optional binary typed_value (STRING); # the shredded value
+      }
+    }
+}
+```
+
+**The table below illustrates how the data is stored:**
+
+| Variant                             | `value`          | 
`typed_value.userId.value` | `typed_value.userId.typed_value` | 
`typed_value.eType.value` | `typed_value.eType.typed_value` |
+|-------------------------------------|------------------|----------------------------|----------------------------------|---------------------------|---------------------|
+| `{"userId": 100, "eType": "login"}` | `null`           | `null`              
       | `100`                            | `null`                    | 
`"login"`           |
+| `100`                               | `100`            |                     
       |                                  |                           |         
|           |
+| `{"userId": "Jim"}`                 | `null`           | `"Jim"`             
       | `null`                           | `null`                    | `null`  
            |
+| `{"userId": 200, "amount": 99}`     | `{"amount": 99}` | `null`              
       | `200`                            | `null`                    | `null`  
            |
+
+---
+
+## Ecosystem Adoption: A Success Story
+
+One of the most remarkable aspects of Variant's addition to Parquet is the 
rapid and widespread ecosystem adoption, demonstrating the strength of 
collaboration within the Apache Parquet community.
+
+Variant support has been implemented across multiple Parquet libraries 
including **Java**, **Arrow C++**, **Rust**, and **Go**. For the most current 
implementation status across all languages and platforms, refer to the 
[official Parquet documentation](https://github.com/apache/parquet-format).

Review Comment:
   It might make sense to refer to the implementation status page here instead 
https://parquet.apache.org/docs/file-format/implementationstatus/
   
   ```suggestion
   Variant support has been implemented across multiple Parquet libraries 
including **Java**, **Arrow C++**, **Rust**, and **Go**. For the most current 
implementation status across all languages and platforms, refer to the 
[official Parquet implementation status 
page](https://parquet.apache.org/docs/file-format/implementationstatus/).
   ```



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"

Review Comment:
   The title looked a little wordy when I rendered it locally. What do you 
think about making it slightly shorter, something like this perhaps?
   
   ```suggestion
   title: "Introducing Variant in Apache Parquet for Semi-Structured Data"
   ```
   
   <img width="838" height="268" alt="Image" 
src="https://github.com/user-attachments/assets/120426fa-b70a-4642-a2ee-068d4a001075";
 />



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,263 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"

Review Comment:
   Thank you -- that is very nice of you. I appreciate it



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction
+
+The Apache Parquet community is excited to announce the addition of the 
**Variant type**—a feature that brings native support for semi-structured data 
to Parquet, significantly improving efficiency compared to less efficient 
formats such as JSON. This marks a significant addition to Parquet, 
demonstrating how the format continues to evolve to meet modern data 
engineering needs.
+
+While Apache Parquet has long been the standard for structured data where each 
value has a fixed and known type, handling heterogeneous, nested data often 
required a compromise: either store it as a costly-to-parse JSON string or 
flatten it into a rigid schema. The introduction of the Variant logical type 
provides a native, high-performance solution for semi-structured data that is 
already seeing rapid uptake across the ecosystem.
+
+---
+
+## What is Variant?
+
+**Variant** is a self-describing data type designed to efficiently store and 
process semi-structured data—JSON-like documents with arbitrary and evolving 
schemas.
+
+---
+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full 
parsing to access any field, making queries slow and resource-intensive, 
Variant solves this by storing data in a **structured binary format** that 
enables direct field access through offset-based navigation. Query engines can 
jump directly to nested fields without deserializing the entire document, 
dramatically improving performance.
+
+Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: It avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.

Review Comment:
   minor grammar nit:
   
   ```suggestion
   Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: it avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.
   ```



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction

Review Comment:
   I suggest removing this initial heading as it seems to render strangely on 
the document
   
   ```suggestion
   ```
   
   <img width="1156" height="681" alt="Image" 
src="https://github.com/user-attachments/assets/5ae68c3e-01c6-4eaa-be92-aa1137eb838d";
 />



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction
+
+The Apache Parquet community is excited to announce the addition of the 
**Variant type**—a feature that brings native support for semi-structured data 
to Parquet, significantly improving efficiency compared to less efficient 
formats such as JSON. This marks a significant addition to Parquet, 
demonstrating how the format continues to evolve to meet modern data 
engineering needs.
+
+While Apache Parquet has long been the standard for structured data where each 
value has a fixed and known type, handling heterogeneous, nested data often 
required a compromise: either store it as a costly-to-parse JSON string or 
flatten it into a rigid schema. The introduction of the Variant logical type 
provides a native, high-performance solution for semi-structured data that is 
already seeing rapid uptake across the ecosystem.
+
+---
+
+## What is Variant?
+
+**Variant** is a self-describing data type designed to efficiently store and 
process semi-structured data—JSON-like documents with arbitrary and evolving 
schemas.
+
+---
+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full 
parsing to access any field, making queries slow and resource-intensive, 
Variant solves this by storing data in a **structured binary format** that 
enables direct field access through offset-based navigation. Query engines can 
jump directly to nested fields without deserializing the entire document, 
dramatically improving performance.
+
+Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: It avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.
+
+### Key Benefits
+
+- **Type-Preserving Storage:** Original data types are maintained in their 
native formats—data types (integers, strings, booleans, timestamps, etc.) are 
preserved, unlike JSON which has a limited type system with no native support 
for types like timestamps or integers.
+
+- **Efficient Encoding:** The binary format uses field name deduplication to 
minimize storage overhead compared to JSON strings or BSON encoding.
+
+- **Fast Query Performance:** Direct offset-based field access provides 
performance improvement over JSON string parsing. Optional shredding of 
frequently accessed fields into typed columns further enhances query pruning 
and predicate pushdown.
+
+- **Schema Flexibility:** No predefined schema is required, allowing documents 
with different structures to coexist in the same column. This enables seamless 
schema evolution while maintaining full queryability across all schema 
variations, while still taking advantage of common structures when present.
+
+---
+
+## Overview of Variant Type in Parquet
+
+Parquet introduced the [Variant logical 
type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant)
 in [August 2025](https://github.com/apache/parquet-format/pull/509).
+
+### Variant Encoding
+
+In Parquet, Variant is represented as a logical type and stored physically as 
a struct with two binary fields. The encoding is 
[designed](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
 so engines can efficiently navigate nested structures and extract only the 
fields they need, rather than parsing the entire binary blob.
+
+```parquet
+optional group event_data (VARIANT(1)) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+- **`metadata`:** Encodes type information and shared dictionaries (for 
example, field-name dictionaries for objects). This avoids repeatedly storing 
the same strings and enables efficient navigation.
+- **`value`:** Encodes the actual data in a compact binary form, supporting 
primitive values as well as arrays and objects.
+
+#### Example
+
+A web access event can be stored in a single Variant column while preserving 
the original data types:
+
+```json
+{
+  "userId": 12345,
+  "events": [
+    {"eType": "login", "timestamp": "2026-01-15T10:30:00Z"},
+    {"eType": "purchase", "timestamp": "2026-01-15T11:45:00Z", "amount": 99.99}
+  ]
+}
+```
+
+Compared with storing the same payload as a JSON string, Variant retains type 
information (for example, timestamp values are stored as integers rather than 
being stored as strings), which improves correctness, enables more efficient 
querying and requires fewer bytes to store.
+
+Just as importantly, Variant supports **schema variability**: records with 
different shapes can coexist in the same column without requiring schema 
migrations. For example, the following record can be stored alongside the event 
record above:
+
+```json
+{
+  "userId": 12345,
+  "error": "auth_failure" 
+}
+```
+
+---
+
+## Shredding Encoding
+
+To enhance query performance and storage efficiency, Variant data can be 
**shredded** by extracting frequently accessed fields into separate, 
strongly-typed columns, as described in the [detailed shredding 
specification](https://github.com/apache/parquet-format/blob/master/VariantShredding.md).
 For each shredded field:
+
+- If the field **matches the expected schema**, its value is written to the 
strongly typed field.
+- If the field **does not match**, the original representation is written as 
Variant-encoded binary field and the corresponding strongly typed field is left 
NULL.
+
+![Shredding Variant Visualization](/blog/variant/variant_shredding.png)

Review Comment:
   A minor nitpit here is that there seems to be quite a lot of whitespace at 
the bottom of this image
   
   <img width="1067" height="706" alt="Image" 
src="https://github.com/user-attachments/assets/4f0c8f65-6197-420f-a47f-2d138b6ce803";
 />



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction
+
+The Apache Parquet community is excited to announce the addition of the 
**Variant type**—a feature that brings native support for semi-structured data 
to Parquet, significantly improving efficiency compared to less efficient 
formats such as JSON. This marks a significant addition to Parquet, 
demonstrating how the format continues to evolve to meet modern data 
engineering needs.
+
+While Apache Parquet has long been the standard for structured data where each 
value has a fixed and known type, handling heterogeneous, nested data often 
required a compromise: either store it as a costly-to-parse JSON string or 
flatten it into a rigid schema. The introduction of the Variant logical type 
provides a native, high-performance solution for semi-structured data that is 
already seeing rapid uptake across the ecosystem.
+
+---
+
+## What is Variant?
+
+**Variant** is a self-describing data type designed to efficiently store and 
process semi-structured data—JSON-like documents with arbitrary and evolving 
schemas.
+
+---
+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full 
parsing to access any field, making queries slow and resource-intensive, 
Variant solves this by storing data in a **structured binary format** that 
enables direct field access through offset-based navigation. Query engines can 
jump directly to nested fields without deserializing the entire document, 
dramatically improving performance.
+
+Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: It avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.
+
+### Key Benefits
+
+- **Type-Preserving Storage:** Original data types are maintained in their 
native formats—data types (integers, strings, booleans, timestamps, etc.) are 
preserved, unlike JSON which has a limited type system with no native support 
for types like timestamps or integers.
+
+- **Efficient Encoding:** The binary format uses field name deduplication to 
minimize storage overhead compared to JSON strings or BSON encoding.
+
+- **Fast Query Performance:** Direct offset-based field access provides 
performance improvement over JSON string parsing. Optional shredding of 
frequently accessed fields into typed columns further enhances query pruning 
and predicate pushdown.
+
+- **Schema Flexibility:** No predefined schema is required, allowing documents 
with different structures to coexist in the same column. This enables seamless 
schema evolution while maintaining full queryability across all schema 
variations, while still taking advantage of common structures when present.
+
+---
+
+## Overview of Variant Type in Parquet
+
+Parquet introduced the [Variant logical 
type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant)
 in [August 2025](https://github.com/apache/parquet-format/pull/509).
+
+### Variant Encoding
+
+In Parquet, Variant is represented as a logical type and stored physically as 
a struct with two binary fields. The encoding is 
[designed](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
 so engines can efficiently navigate nested structures and extract only the 
fields they need, rather than parsing the entire binary blob.
+
+```parquet
+optional group event_data (VARIANT(1)) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+- **`metadata`:** Encodes type information and shared dictionaries (for 
example, field-name dictionaries for objects). This avoids repeatedly storing 
the same strings and enables efficient navigation.
+- **`value`:** Encodes the actual data in a compact binary form, supporting 
primitive values as well as arrays and objects.
+
+#### Example
+
+A web access event can be stored in a single Variant column while preserving 
the original data types:
+
+```json
+{
+  "userId": 12345,
+  "events": [
+    {"eType": "login", "timestamp": "2026-01-15T10:30:00Z"},
+    {"eType": "purchase", "timestamp": "2026-01-15T11:45:00Z", "amount": 99.99}
+  ]
+}
+```
+
+Compared with storing the same payload as a JSON string, Variant retains type 
information (for example, timestamp values are stored as integers rather than 
being stored as strings), which improves correctness, enables more efficient 
querying and requires fewer bytes to store.
+
+Just as importantly, Variant supports **schema variability**: records with 
different shapes can coexist in the same column without requiring schema 
migrations. For example, the following record can be stored alongside the event 
record above:
+
+```json
+{
+  "userId": 12345,
+  "error": "auth_failure" 
+}
+```
+
+---
+
+## Shredding Encoding
+
+To enhance query performance and storage efficiency, Variant data can be 
**shredded** by extracting frequently accessed fields into separate, 
strongly-typed columns, as described in the [detailed shredding 
specification](https://github.com/apache/parquet-format/blob/master/VariantShredding.md).
 For each shredded field:
+
+- If the field **matches the expected schema**, its value is written to the 
strongly typed field.
+- If the field **does not match**, the original representation is written as 
Variant-encoded binary field and the corresponding strongly typed field is left 
NULL.
+
+![Shredding Variant Visualization](/blog/variant/variant_shredding.png)
+
+The query engine decides which fields to shred based on access patterns and 
workload characteristics. Once shredded, the standard Parquet columnar 
optimizations (encoding, compression, statistics) are used for the typed 
columns.
+
+### Implementation Considerations
+
+- **Schema Inference:** Engines can infer the shredding schema from sample 
data by selecting the most frequently occurring type for each field. For 
example, if `event.id` is predominantly an integer, the engine shreds it to an 
INT64 column.
+
+- **Type Promotion:** To maximize shredding coverage, engines can promote 
types within the same type family. For example, if integer values vary in size 
(INT8, INT32, INT64), selecting INT64 as the shredded type ensures all integer 
values can be shredded rather than falling back to the unshredded 
representation.
+
+- **Metadata Control:** To control metadata overhead, engines may limit the 
number of shredded fields, since each field contributes statistics (min/max 
values, null counts) to the file footer and column stats.
+
+- **Explicit Shredding Schema:** When read patterns are known in advance, 
engines can specify an explicit shredding schema at write time, ensuring that 
frequently accessed fields are shredded for optimal query performance.
+
+### Performance Characteristics
+
+- **Selective field access:** When queries access only the shredded fields, 
only those columns are read from Parquet, skipping the rest, benefiting from 
column pruning and predicate pushdown.
+
+- **Full Variant reconstruction:** When queries require access to the complete 
Variant object, there is a performance overhead as the engine must reconstruct 
the Variant by merging data from the shredded typed fields and the base Variant 
column.
+
+### Examples of Shredded Parquet Schemas
+
+The following example shows shredding non nested Variants. In this case, the 
writer chose to shred String values as the `typed_value` column.  Rows which do 
not contain strings are stored in the `value` column, with the binary variant 
encoding.

Review Comment:
   Here is a suggestion to make this wording more consistent with the rest of 
the post: 
   
   ```suggestion
   The following example shows shredding non-nested Variant values. In this 
case, the writer chose to shred string values as the `typed_value` column. Rows 
that do not contain strings are stored in the `value` column with binary 
Variant encoding.
   ```



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction
+
+The Apache Parquet community is excited to announce the addition of the 
**Variant type**—a feature that brings native support for semi-structured data 
to Parquet, significantly improving efficiency compared to less efficient 
formats such as JSON. This marks a significant addition to Parquet, 
demonstrating how the format continues to evolve to meet modern data 
engineering needs.
+
+While Apache Parquet has long been the standard for structured data where each 
value has a fixed and known type, handling heterogeneous, nested data often 
required a compromise: either store it as a costly-to-parse JSON string or 
flatten it into a rigid schema. The introduction of the Variant logical type 
provides a native, high-performance solution for semi-structured data that is 
already seeing rapid uptake across the ecosystem.
+
+---
+
+## What is Variant?
+
+**Variant** is a self-describing data type designed to efficiently store and 
process semi-structured data—JSON-like documents with arbitrary and evolving 
schemas.
+
+---
+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full 
parsing to access any field, making queries slow and resource-intensive, 
Variant solves this by storing data in a **structured binary format** that 
enables direct field access through offset-based navigation. Query engines can 
jump directly to nested fields without deserializing the entire document, 
dramatically improving performance.
+
+Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: It avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.
+
+### Key Benefits
+
+- **Type-Preserving Storage:** Original data types are maintained in their 
native formats—data types (integers, strings, booleans, timestamps, etc.) are 
preserved, unlike JSON which has a limited type system with no native support 
for types like timestamps or integers.
+
+- **Efficient Encoding:** The binary format uses field name deduplication to 
minimize storage overhead compared to JSON strings or BSON encoding.
+
+- **Fast Query Performance:** Direct offset-based field access provides 
performance improvement over JSON string parsing. Optional shredding of 
frequently accessed fields into typed columns further enhances query pruning 
and predicate pushdown.
+
+- **Schema Flexibility:** No predefined schema is required, allowing documents 
with different structures to coexist in the same column. This enables seamless 
schema evolution while maintaining full queryability across all schema 
variations, while still taking advantage of common structures when present.
+
+---
+
+## Overview of Variant Type in Parquet
+
+Parquet introduced the [Variant logical 
type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant)
 in [August 2025](https://github.com/apache/parquet-format/pull/509).
+
+### Variant Encoding
+
+In Parquet, Variant is represented as a logical type and stored physically as 
a struct with two binary fields. The encoding is 
[designed](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
 so engines can efficiently navigate nested structures and extract only the 
fields they need, rather than parsing the entire binary blob.
+
+```parquet
+optional group event_data (VARIANT(1)) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+- **`metadata`:** Encodes type information and shared dictionaries (for 
example, field-name dictionaries for objects). This avoids repeatedly storing 
the same strings and enables efficient navigation.
+- **`value`:** Encodes the actual data in a compact binary form, supporting 
primitive values as well as arrays and objects.
+
+#### Example
+
+A web access event can be stored in a single Variant column while preserving 
the original data types:
+
+```json
+{
+  "userId": 12345,
+  "events": [
+    {"eType": "login", "timestamp": "2026-01-15T10:30:00Z"},
+    {"eType": "purchase", "timestamp": "2026-01-15T11:45:00Z", "amount": 99.99}
+  ]
+}
+```
+
+Compared with storing the same payload as a JSON string, Variant retains type 
information (for example, timestamp values are stored as integers rather than 
being stored as strings), which improves correctness, enables more efficient 
querying and requires fewer bytes to store.
+
+Just as importantly, Variant supports **schema variability**: records with 
different shapes can coexist in the same column without requiring schema 
migrations. For example, the following record can be stored alongside the event 
record above:
+
+```json
+{
+  "userId": 12345,
+  "error": "auth_failure" 
+}
+```
+
+---
+
+## Shredding Encoding
+
+To enhance query performance and storage efficiency, Variant data can be 
**shredded** by extracting frequently accessed fields into separate, 
strongly-typed columns, as described in the [detailed shredding 
specification](https://github.com/apache/parquet-format/blob/master/VariantShredding.md).
 For each shredded field:
+
+- If the field **matches the expected schema**, its value is written to the 
strongly typed field.
+- If the field **does not match**, the original representation is written as 
Variant-encoded binary field and the corresponding strongly typed field is left 
NULL.
+
+![Shredding Variant Visualization](/blog/variant/variant_shredding.png)
+
+The query engine decides which fields to shred based on access patterns and 
workload characteristics. Once shredded, the standard Parquet columnar 
optimizations (encoding, compression, statistics) are used for the typed 
columns.
+
+### Implementation Considerations
+
+- **Schema Inference:** Engines can infer the shredding schema from sample 
data by selecting the most frequently occurring type for each field. For 
example, if `event.id` is predominantly an integer, the engine shreds it to an 
INT64 column.
+
+- **Type Promotion:** To maximize shredding coverage, engines can promote 
types within the same type family. For example, if integer values vary in size 
(INT8, INT32, INT64), selecting INT64 as the shredded type ensures all integer 
values can be shredded rather than falling back to the unshredded 
representation.
+
+- **Metadata Control:** To control metadata overhead, engines may limit the 
number of shredded fields, since each field contributes statistics (min/max 
values, null counts) to the file footer and column stats.
+
+- **Explicit Shredding Schema:** When read patterns are known in advance, 
engines can specify an explicit shredding schema at write time, ensuring that 
frequently accessed fields are shredded for optimal query performance.
+
+### Performance Characteristics
+
+- **Selective field access:** When queries access only the shredded fields, 
only those columns are read from Parquet, skipping the rest, benefiting from 
column pruning and predicate pushdown.
+
+- **Full Variant reconstruction:** When queries require access to the complete 
Variant object, there is a performance overhead as the engine must reconstruct 
the Variant by merging data from the shredded typed fields and the base Variant 
column.
+
+### Examples of Shredded Parquet Schemas
+
+The following example shows shredding non nested Variants. In this case, the 
writer chose to shred String values as the `typed_value` column.  Rows which do 
not contain strings are stored in the `value` column, with the binary variant 
encoding.
+
+```parquet
+optional group SIMPLE_DATA (VARIANT(1)) = 1 { 
+    required binary metadata;           # variant metadata
+    optional binary value;              # non-shredded value   
+    optional binary typed_value (STRING); # the shredded value 
+}
+```
+
+The series of variant values “Jim”, 100,  {“name”: “Jim”} are encoded as:
+
+| Variant Value | `value` | `typed_value` |
+|---------------|---------|---------------|
+| `"Jim"` | `null` | `"Jim"` |
+| `100` | `100` | `null` |
+| `{"name": "Jim"}` | `{"name": "Jim"}` | `null` |
+
+---
+
+Shredding nested variants is similar, with the shredding applied recursively, 
as shown in the following example. In this case, the `userId` field is shredded 
as an integer, and stored as two columns: in `typed_value.userId.typed_value` 
when the value is integer and as a variant in `typed_value.userId.value` 
otherwise. Similarly, the `eType` field is shredded as a string and stored in 
`typed_value.eType.typed_value` and `typed_value.eType.value`.

Review Comment:
   ```suggestion
   Shredding nested Variant values is similar, with shredding applied 
recursively, as shown in the following example. In this case, the `userId` 
field is shredded as an integer and stored in two columns: 
`typed_value.userId.typed_value` when the value is an integer, and 
`typed_value.userId.value` otherwise. Similarly, the `eType` field is shredded 
as a string and stored in `typed_value.eType.typed_value` and 
`typed_value.eType.value`.
   
   ```



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction
+
+The Apache Parquet community is excited to announce the addition of the 
**Variant type**—a feature that brings native support for semi-structured data 
to Parquet, significantly improving efficiency compared to less efficient 
formats such as JSON. This marks a significant addition to Parquet, 
demonstrating how the format continues to evolve to meet modern data 
engineering needs.
+
+While Apache Parquet has long been the standard for structured data where each 
value has a fixed and known type, handling heterogeneous, nested data often 
required a compromise: either store it as a costly-to-parse JSON string or 
flatten it into a rigid schema. The introduction of the Variant logical type 
provides a native, high-performance solution for semi-structured data that is 
already seeing rapid uptake across the ecosystem.
+
+---
+
+## What is Variant?
+
+**Variant** is a self-describing data type designed to efficiently store and 
process semi-structured data—JSON-like documents with arbitrary and evolving 
schemas.
+
+---
+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full 
parsing to access any field, making queries slow and resource-intensive, 
Variant solves this by storing data in a **structured binary format** that 
enables direct field access through offset-based navigation. Query engines can 
jump directly to nested fields without deserializing the entire document, 
dramatically improving performance.
+
+Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: It avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.
+
+### Key Benefits
+
+- **Type-Preserving Storage:** Original data types are maintained in their 
native formats—data types (integers, strings, booleans, timestamps, etc.) are 
preserved, unlike JSON which has a limited type system with no native support 
for types like timestamps or integers.
+
+- **Efficient Encoding:** The binary format uses field name deduplication to 
minimize storage overhead compared to JSON strings or BSON encoding.
+
+- **Fast Query Performance:** Direct offset-based field access provides 
performance improvement over JSON string parsing. Optional shredding of 
frequently accessed fields into typed columns further enhances query pruning 
and predicate pushdown.

Review Comment:
   ```suggestion
   - **Fast Query Performance:** Direct offset-based field access provides 
performance improvements over JSON string parsing. Optional shredding of 
frequently accessed fields into typed columns further enhances query pruning 
and predicate pushdown.
   ```



##########
content/en/blog/features/variant.md:
##########
@@ -0,0 +1,256 @@
+---
+title: "The Evolution of Semi-Structured Data: Introducing Variant in Apache 
Parquet"
+date: 2026-02-14
+description: "Native Variant Type in Apache Parquet"
+author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew 
Lamb](https://github.com/alamb)"
+categories: ["features"]
+---
+
+## Introduction
+
+The Apache Parquet community is excited to announce the addition of the 
**Variant type**—a feature that brings native support for semi-structured data 
to Parquet, significantly improving efficiency compared to less efficient 
formats such as JSON. This marks a significant addition to Parquet, 
demonstrating how the format continues to evolve to meet modern data 
engineering needs.
+
+While Apache Parquet has long been the standard for structured data where each 
value has a fixed and known type, handling heterogeneous, nested data often 
required a compromise: either store it as a costly-to-parse JSON string or 
flatten it into a rigid schema. The introduction of the Variant logical type 
provides a native, high-performance solution for semi-structured data that is 
already seeing rapid uptake across the ecosystem.
+
+---
+
+## What is Variant?
+
+**Variant** is a self-describing data type designed to efficiently store and 
process semi-structured data—JSON-like documents with arbitrary and evolving 
schemas.
+
+---
+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full 
parsing to access any field, making queries slow and resource-intensive, 
Variant solves this by storing data in a **structured binary format** that 
enables direct field access through offset-based navigation. Query engines can 
jump directly to nested fields without deserializing the entire document, 
dramatically improving performance.
+
+Unlike similar binary encodings such as BSON, Variant is optimized for the 
common case where multiple values share a similar structure: It avoids 
redundantly storing repeated field names and standardizes the best practice of 
**"shredded storage"** for pre-extracting structured subsets.
+
+### Key Benefits
+
+- **Type-Preserving Storage:** Original data types are maintained in their 
native formats—data types (integers, strings, booleans, timestamps, etc.) are 
preserved, unlike JSON which has a limited type system with no native support 
for types like timestamps or integers.
+
+- **Efficient Encoding:** The binary format uses field name deduplication to 
minimize storage overhead compared to JSON strings or BSON encoding.
+
+- **Fast Query Performance:** Direct offset-based field access provides 
performance improvement over JSON string parsing. Optional shredding of 
frequently accessed fields into typed columns further enhances query pruning 
and predicate pushdown.
+
+- **Schema Flexibility:** No predefined schema is required, allowing documents 
with different structures to coexist in the same column. This enables seamless 
schema evolution while maintaining full queryability across all schema 
variations, while still taking advantage of common structures when present.
+
+---
+
+## Overview of Variant Type in Parquet
+
+Parquet introduced the [Variant logical 
type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant)
 in [August 2025](https://github.com/apache/parquet-format/pull/509).
+
+### Variant Encoding
+
+In Parquet, Variant is represented as a logical type and stored physically as 
a struct with two binary fields. The encoding is 
[designed](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
 so engines can efficiently navigate nested structures and extract only the 
fields they need, rather than parsing the entire binary blob.
+
+```parquet
+optional group event_data (VARIANT(1)) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+- **`metadata`:** Encodes type information and shared dictionaries (for 
example, field-name dictionaries for objects). This avoids repeatedly storing 
the same strings and enables efficient navigation.
+- **`value`:** Encodes the actual data in a compact binary form, supporting 
primitive values as well as arrays and objects.
+
+#### Example
+
+A web access event can be stored in a single Variant column while preserving 
the original data types:
+
+```json
+{
+  "userId": 12345,
+  "events": [
+    {"eType": "login", "timestamp": "2026-01-15T10:30:00Z"},
+    {"eType": "purchase", "timestamp": "2026-01-15T11:45:00Z", "amount": 99.99}
+  ]
+}
+```
+
+Compared with storing the same payload as a JSON string, Variant retains type 
information (for example, timestamp values are stored as integers rather than 
being stored as strings), which improves correctness, enables more efficient 
querying and requires fewer bytes to store.
+
+Just as importantly, Variant supports **schema variability**: records with 
different shapes can coexist in the same column without requiring schema 
migrations. For example, the following record can be stored alongside the event 
record above:
+
+```json
+{
+  "userId": 12345,
+  "error": "auth_failure" 
+}
+```
+
+---
+
+## Shredding Encoding
+
+To enhance query performance and storage efficiency, Variant data can be 
**shredded** by extracting frequently accessed fields into separate, 
strongly-typed columns, as described in the [detailed shredding 
specification](https://github.com/apache/parquet-format/blob/master/VariantShredding.md).
 For each shredded field:
+
+- If the field **matches the expected schema**, its value is written to the 
strongly typed field.
+- If the field **does not match**, the original representation is written as 
Variant-encoded binary field and the corresponding strongly typed field is left 
NULL.
+
+![Shredding Variant Visualization](/blog/variant/variant_shredding.png)
+
+The query engine decides which fields to shred based on access patterns and 
workload characteristics. Once shredded, the standard Parquet columnar 
optimizations (encoding, compression, statistics) are used for the typed 
columns.
+
+### Implementation Considerations
+
+- **Schema Inference:** Engines can infer the shredding schema from sample 
data by selecting the most frequently occurring type for each field. For 
example, if `event.id` is predominantly an integer, the engine shreds it to an 
INT64 column.
+
+- **Type Promotion:** To maximize shredding coverage, engines can promote 
types within the same type family. For example, if integer values vary in size 
(INT8, INT32, INT64), selecting INT64 as the shredded type ensures all integer 
values can be shredded rather than falling back to the unshredded 
representation.
+
+- **Metadata Control:** To control metadata overhead, engines may limit the 
number of shredded fields, since each field contributes statistics (min/max 
values, null counts) to the file footer and column stats.
+
+- **Explicit Shredding Schema:** When read patterns are known in advance, 
engines can specify an explicit shredding schema at write time, ensuring that 
frequently accessed fields are shredded for optimal query performance.
+
+### Performance Characteristics
+
+- **Selective field access:** When queries access only the shredded fields, 
only those columns are read from Parquet, skipping the rest, benefiting from 
column pruning and predicate pushdown.
+
+- **Full Variant reconstruction:** When queries require access to the complete 
Variant object, there is a performance overhead as the engine must reconstruct 
the Variant by merging data from the shredded typed fields and the base Variant 
column.
+
+### Examples of Shredded Parquet Schemas
+
+The following example shows shredding non nested Variants. In this case, the 
writer chose to shred String values as the `typed_value` column.  Rows which do 
not contain strings are stored in the `value` column, with the binary variant 
encoding.
+
+```parquet
+optional group SIMPLE_DATA (VARIANT(1)) = 1 { 
+    required binary metadata;           # variant metadata
+    optional binary value;              # non-shredded value   
+    optional binary typed_value (STRING); # the shredded value 
+}
+```
+
+The series of variant values “Jim”, 100,  {“name”: “Jim”} are encoded as:
+
+| Variant Value | `value` | `typed_value` |
+|---------------|---------|---------------|
+| `"Jim"` | `null` | `"Jim"` |
+| `100` | `100` | `null` |
+| `{"name": "Jim"}` | `{"name": "Jim"}` | `null` |
+
+---
+
+Shredding nested variants is similar, with the shredding applied recursively, 
as shown in the following example. In this case, the `userId` field is shredded 
as an integer, and stored as two columns: in `typed_value.userId.typed_value` 
when the value is integer and as a variant in `typed_value.userId.value` 
otherwise. Similarly, the `eType` field is shredded as a string and stored in 
`typed_value.eType.typed_value` and `typed_value.eType.value`.
+```parquet
+optional group EVENT_DATA (VARIANT(1)) = 1 {
+    required binary metadata;           # variant metadata
+    optional binary value;              # non-shredded value   
+    optional group typed_value {
+      required group userId {          # userId field
+        optional binary value;          # non-shredded value
+        optional int32 typed_value;     # the shredded value
+      }
+      required group eType {             # eType field
+        optional binary value;          # non-shredded value
+        optional binary typed_value (STRING); # the shredded value
+      }
+    }
+}
+```
+
+**The table below illustrates how the data is stored:**
+
+| Variant                             | `value`          | 
`typed_value.userId.value` | `typed_value.userId.typed_value` | 
`typed_value.eType.value` | `typed_value.eType.typed_value` |
+|-------------------------------------|------------------|----------------------------|----------------------------------|---------------------------|---------------------|
+| `{"userId": 100, "eType": "login"}` | `null`           | `null`              
       | `100`                            | `null`                    | 
`"login"`           |
+| `100`                               | `100`            |                     
       |                                  |                           |         
|           |

Review Comment:
   Should the values in the `typed_value.userId.value` etc columns be null for 
this row too to match the other rows?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [BLOG] Variant Blog [parquet-site]

Reply via email to