alamb commented on code in PR #171: URL: https://github.com/apache/parquet-site/pull/171#discussion_r2841140739
########## content/en/blog/features/variant.md: ########## @@ -0,0 +1,257 @@ +--- +title: "Introducing Variant in Apache Parquet for Semi-Structured Data" +date: 2026-02-14 +description: "Native Variant Type in Apache Parquet" +author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew Lamb](https://github.com/alamb)" +categories: ["features"] +--- + +## Introduction + +The Apache Parquet community is excited to announce the addition of the **Variant type**—a feature that brings native support for semi-structured data to Parquet, significantly improving efficiency compared to less efficient formats such as JSON. This marks a significant addition to Parquet, demonstrating how the format continues to evolve to meet modern data engineering needs. + +While Apache Parquet has long been the standard for structured data where each value has a fixed and known type, handling heterogeneous, nested data often required a compromise: either store it as a costly-to-parse JSON string or flatten it into a rigid schema. The introduction of the Variant logical type provides a native, high-performance solution for semi-structured data that is already seeing rapid uptake across the ecosystem. + +--- + +## What is Variant? + +**Variant** is a self-describing data type designed to efficiently store and process semi-structured data—JSON-like documents with arbitrary and evolving schemas. + +--- + +## Why Variant? + +Unlike traditional approaches that store JSON as text strings and require full parsing to access any field, making queries slow and resource-intensive, Variant solves this by storing data in a **structured binary format** that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance. + +Unlike similar binary encodings such as BSON, Variant is optimized for the common case where multiple values share a similar structure: it avoids redundantly storing repeated field names and standardizes the best practice of **"shredded storage"** for pre-extracting structured subsets. + +### Key Benefits + +- **Type-Preserving Storage:** Original data types are maintained in their native formats—data types (integers, strings, booleans, timestamps, etc.) are preserved, unlike JSON which has a limited type system with no native support for types like timestamps or integers. + +- **Efficient Encoding:** The binary format uses field name deduplication to minimize storage overhead compared to JSON strings or BSON encoding. + +- **Fast Query Performance:** Direct offset-based field access provides performance improvements over JSON string parsing. Optional shredding of frequently accessed fields into typed columns further enhances query pruning and predicate pushdown. + +- **Schema Flexibility:** No predefined schema is required, allowing documents with different structures to coexist in the same column. This enables seamless schema evolution while maintaining full queryability across all schema variations, while still taking advantage of common structures when present. + +--- + +## Overview of Variant Type in Parquet + +Parquet introduced the [Variant logical type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant) in [August 2025](https://github.com/apache/parquet-format/pull/509). + +### Variant Encoding + +In Parquet, Variant is represented as a logical type and stored physically as a struct with two binary fields. The encoding is [designed](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) so engines can efficiently navigate nested structures and extract only the fields they need, rather than parsing the entire binary blob. + +```parquet +optional group event_data (VARIANT(1)) { + required binary metadata; + required binary value; +} +``` + +- **`metadata`:** Encodes type information and shared dictionaries (for example, field-name dictionaries for objects). This avoids repeatedly storing the same strings and enables efficient navigation. +- **`value`:** Encodes the actual data in a compact binary form, supporting primitive values as well as arrays and objects. + +#### Example Review Comment: @aihuaxu reworked the intro to incorporate this feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
