alamb commented on code in PR #739:
URL: https://github.com/apache/arrow-site/pull/739#discussion_r2566369720
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
Review Comment:
It might help here to also point out the benefits of JSON. Namely
1. You don't have to define your schema up front before loading it into the
OLAP system( the biggest I think)
1. it can tolerate different structures row by row (the messy part)
Supporting JSON/semi-structured types make OLAP systems applicable for many
tasks traditionally done *before* loading into the data system (like ETL),
enabling users to take advantge of the many great OLAP engines.
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
Review Comment:
it might also be worth emphasizing again that single field reads require
both decoding the string, and then *parsing* that string into an in memory
structure, and then walking that in memory structure for each row
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
Review Comment:
I suggest adding links (e.g. to the crates.io arrow page)
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently
encoded JSON object, common object fields are redundantly serialized over and
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns
of a JSON column. One common solution is to extract commonly occuring object
keys into a dedicated column, also known as object shredding. Yet, this
requires substantial engineering effort and implementation details vary by
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls,
objects, and arrays. This simplicity comes at a cost, as specialized data types
like timestamps and UUIDs must be encoded as strings, then inferred and parsed
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native
types organized into three categories: primitives, objects, and arrays. Objects
and arrays work exactly like their JSON counterparts, supporting arbitrary
nested structures. Variant extends the primitive category to include 20
specialized types. Dates, timestamps, UUIDs, binary data, and integers and
floats of various width all have their own native representations. Values are
encoded in type-specific binary formats, optimized for their native type rather
than a stringifed representation. This reduces both storage-overhead and
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21"
}` thousands of times, the field names `user_id` and `timestamp` are written
out in full for each row, even though they're identical across rows.
Review Comment:
this is a good example, I think we could make a nice diagram that
illustrates the point nicely -- aka a bunch of JSON documents with repeated
fields showing that when encoded as JSON the strings get repeated over and over
Then we can add a "here is how you would encode it as variant" showing both
the single copy of field names as well as the split into two different fields 🤔
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently
encoded JSON object, common object fields are redundantly serialized over and
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns
of a JSON column. One common solution is to extract commonly occuring object
keys into a dedicated column, also known as object shredding. Yet, this
requires substantial engineering effort and implementation details vary by
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls,
objects, and arrays. This simplicity comes at a cost, as specialized data types
like timestamps and UUIDs must be encoded as strings, then inferred and parsed
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native
types organized into three categories: primitives, objects, and arrays. Objects
and arrays work exactly like their JSON counterparts, supporting arbitrary
nested structures. Variant extends the primitive category to include 20
specialized types. Dates, timestamps, UUIDs, binary data, and integers and
floats of various width all have their own native representations. Values are
encoded in type-specific binary formats, optimized for their native type rather
than a stringifed representation. This reduces both storage-overhead and
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21"
}` thousands of times, the field names `user_id` and `timestamp` are written
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata`
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value`
column stores the actual data. When encoding an object, field names aren't
written inline. Instead, each field references its name by offset position in
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build
a single metadata dictionary per column chunk, containing the union of all
unique field names across every row in that chunk. Rows with different schemas
both reference the same shared dictionary. This way, each unique field name
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order
by field name. This ordering constraint enables efficient field lookups:
finding a value by field name takes O(log(n)) time by binary search, where n is
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field.
For deeply nested objects with dozens of fields, this difference compounds
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to
access any field, as the entire data must be decoded just to read the single
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The
specification defines how to extract frequently accessed fields from a Variant
value and store them as separate typed columns in the file format. For example,
if you frequently query a timestamp field called `start_timestamp` or a 64-bit
integer field called `user_id`, it can be shredded into dedicated timestamp and
integer columns alongside the Variant columns.
+
+This enables columnar file formats like Parquet to leverage their full
optimization toolset Suppose you had to evaluate a query like:
+
+```sql
+SELECT user_id, start_timestamp from events
+WHERE start_timestamp > '2025-01-01' AND user_id = 12345;
+```
+
+Zone maps can skip entire row groups where `start_timestamp` falls outside the
query range. Dictionary encoding can compress repeated `user_id` values
efficiently. Bloom filters can rule out row groups without your target user.
More importantly, you only deserialize the shredded columns you need, not the
Variant columns.
+
+Shredding makes the trade-off explicit: extract frequently queried fields into
optimized columns at write time, and keep everything else flexible. You get
columnar performance for common access patterns and schema flexibility for
everything else.
Review Comment:
I think it would help to show how this would work in practice.
```sql
SELECT count(*) from events
WHERE event["start_timestamp"] > '2025-01-01' AND event["user_id"] = 12345;
```
To
```sql
SELECT count(*) from events
WHERE typed_value.start_timestamp > '2025-01-01' AND typed_value.user_id =
12345;
```
you would have to then explain that predicate pushdown works for the typed
typed column
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently
encoded JSON object, common object fields are redundantly serialized over and
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns
of a JSON column. One common solution is to extract commonly occuring object
keys into a dedicated column, also known as object shredding. Yet, this
requires substantial engineering effort and implementation details vary by
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls,
objects, and arrays. This simplicity comes at a cost, as specialized data types
like timestamps and UUIDs must be encoded as strings, then inferred and parsed
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native
types organized into three categories: primitives, objects, and arrays. Objects
and arrays work exactly like their JSON counterparts, supporting arbitrary
nested structures. Variant extends the primitive category to include 20
specialized types. Dates, timestamps, UUIDs, binary data, and integers and
floats of various width all have their own native representations. Values are
encoded in type-specific binary formats, optimized for their native type rather
than a stringifed representation. This reduces both storage-overhead and
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21"
}` thousands of times, the field names `user_id` and `timestamp` are written
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata`
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value`
column stores the actual data. When encoding an object, field names aren't
written inline. Instead, each field references its name by offset position in
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build
a single metadata dictionary per column chunk, containing the union of all
unique field names across every row in that chunk. Rows with different schemas
both reference the same shared dictionary. This way, each unique field name
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order
by field name. This ordering constraint enables efficient field lookups:
finding a value by field name takes O(log(n)) time by binary search, where n is
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field.
For deeply nested objects with dozens of fields, this difference compounds
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to
access any field, as the entire data must be decoded just to read the single
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The
specification defines how to extract frequently accessed fields from a Variant
value and store them as separate typed columns in the file format. For example,
if you frequently query a timestamp field called `start_timestamp` or a 64-bit
integer field called `user_id`, it can be shredded into dedicated timestamp and
integer columns alongside the Variant columns.
Review Comment:
```suggestion
Variant solves this problem by standardizing a shredding specification. The
specification defines how to extract frequently accessed fields from a Variant
value and store them as separate typed columns in the file format. For example,
if you frequently query a timestamp field called `start_timestamp` or a 64-bit
integer field called `user_id`, it can be shredded into dedicated timestamp and
integer columns alongside the Variant columns.
```
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
Review Comment:
```suggestion
This article explains the limitations of JSON, introduces the Variant type
that solves these problems, and explains why you should be excited about it by
analyzing performance characteristics on a real world dataset.
```
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently
encoded JSON object, common object fields are redundantly serialized over and
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns
of a JSON column. One common solution is to extract commonly occuring object
keys into a dedicated column, also known as object shredding. Yet, this
requires substantial engineering effort and implementation details vary by
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls,
objects, and arrays. This simplicity comes at a cost, as specialized data types
like timestamps and UUIDs must be encoded as strings, then inferred and parsed
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native
types organized into three categories: primitives, objects, and arrays. Objects
and arrays work exactly like their JSON counterparts, supporting arbitrary
nested structures. Variant extends the primitive category to include 20
specialized types. Dates, timestamps, UUIDs, binary data, and integers and
floats of various width all have their own native representations. Values are
encoded in type-specific binary formats, optimized for their native type rather
than a stringifed representation. This reduces both storage-overhead and
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21"
}` thousands of times, the field names `user_id` and `timestamp` are written
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata`
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value`
column stores the actual data. When encoding an object, field names aren't
written inline. Instead, each field references its name by offset position in
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build
a single metadata dictionary per column chunk, containing the union of all
unique field names across every row in that chunk. Rows with different schemas
both reference the same shared dictionary. This way, each unique field name
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order
by field name. This ordering constraint enables efficient field lookups:
finding a value by field name takes O(log(n)) time by binary search, where n is
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field.
For deeply nested objects with dozens of fields, this difference compounds
quickly.
+
+_maybe i should add some benchmarks here?_
Review Comment:
I think this is a nice detail but probably not worth belaboring -- this
section is an example of "efficient binary encoding" I think
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently
encoded JSON object, common object fields are redundantly serialized over and
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns
of a JSON column. One common solution is to extract commonly occuring object
keys into a dedicated column, also known as object shredding. Yet, this
requires substantial engineering effort and implementation details vary by
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls,
objects, and arrays. This simplicity comes at a cost, as specialized data types
like timestamps and UUIDs must be encoded as strings, then inferred and parsed
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native
types organized into three categories: primitives, objects, and arrays. Objects
and arrays work exactly like their JSON counterparts, supporting arbitrary
nested structures. Variant extends the primitive category to include 20
specialized types. Dates, timestamps, UUIDs, binary data, and integers and
floats of various width all have their own native representations. Values are
encoded in type-specific binary formats, optimized for their native type rather
than a stringifed representation. This reduces both storage-overhead and
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21"
}` thousands of times, the field names `user_id` and `timestamp` are written
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata`
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value`
column stores the actual data. When encoding an object, field names aren't
written inline. Instead, each field references its name by offset position in
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build
a single metadata dictionary per column chunk, containing the union of all
unique field names across every row in that chunk. Rows with different schemas
both reference the same shared dictionary. This way, each unique field name
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order
by field name. This ordering constraint enables efficient field lookups:
finding a value by field name takes O(log(n)) time by binary search, where n is
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field.
For deeply nested objects with dozens of fields, this difference compounds
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
Review Comment:
I know this is a general capability applicable to all (columnar) file
formats, but since this blog is about the Variant that was added to Parquet, I
recommend you focus this part of Parquet ("Variant can leverage Parquet's
columnar architecture" or something)
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently
encoded JSON object, common object fields are redundantly serialized over and
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns
of a JSON column. One common solution is to extract commonly occuring object
keys into a dedicated column, also known as object shredding. Yet, this
requires substantial engineering effort and implementation details vary by
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls,
objects, and arrays. This simplicity comes at a cost, as specialized data types
like timestamps and UUIDs must be encoded as strings, then inferred and parsed
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native
types organized into three categories: primitives, objects, and arrays. Objects
and arrays work exactly like their JSON counterparts, supporting arbitrary
nested structures. Variant extends the primitive category to include 20
specialized types. Dates, timestamps, UUIDs, binary data, and integers and
floats of various width all have their own native representations. Values are
encoded in type-specific binary formats, optimized for their native type rather
than a stringifed representation. This reduces both storage-overhead and
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21"
}` thousands of times, the field names `user_id` and `timestamp` are written
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata`
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value`
column stores the actual data. When encoding an object, field names aren't
written inline. Instead, each field references its name by offset position in
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build
a single metadata dictionary per column chunk, containing the union of all
unique field names across every row in that chunk. Rows with different schemas
both reference the same shared dictionary. This way, each unique field name
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order
by field name. This ordering constraint enables efficient field lookups:
finding a value by field name takes O(log(n)) time by binary search, where n is
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field.
For deeply nested objects with dozens of fields, this difference compounds
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to
access any field, as the entire data must be decoded just to read the single
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The
specification defines how to extract frequently accessed fields from a Variant
value and store them as separate typed columns in the file format. For example,
if you frequently query a timestamp field called `start_timestamp` or a 64-bit
integer field called `user_id`, it can be shredded into dedicated timestamp and
integer columns alongside the Variant columns.
+
+This enables columnar file formats like Parquet to leverage their full
optimization toolset Suppose you had to evaluate a query like:
+
+```sql
+SELECT user_id, start_timestamp from events
+WHERE start_timestamp > '2025-01-01' AND user_id = 12345;
+```
+
+Zone maps can skip entire row groups where `start_timestamp` falls outside the
query range. Dictionary encoding can compress repeated `user_id` values
efficiently. Bloom filters can rule out row groups without your target user.
More importantly, you only deserialize the shredded columns you need, not the
Variant columns.
+
+Shredding makes the trade-off explicit: extract frequently queried fields into
optimized columns at write time, and keep everything else flexible. You get
columnar performance for common access patterns and schema flexibility for
everything else.
+
+## Why you should be excited about Variant
+
+In this section, we'll explore the performance characteristics of JSON and
Variant data. We'll use Clickhouse's JSON Bench
[dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events),
a billion-record collection of Bluesky events designed to represent real-world
production data.
+
+Our benchmarks focus on three key metrics: write performance when serializing
data to Parquet files, storage efficiency comparing both compressed and
uncompressed file sizes, and query perfomance across common access patterns.
+
+For query execution, we use Datafusion, a popular query engine. To work with
Variant in Datafusion, we use
[datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant),
a library that implements native Variant type support.
+
Review Comment:
this looks great -- I can't wait for this section!
##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these
problems, and why you should be excited about it by analyzing performance
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires
scanning every row and deserializing the entire JSON payload into memory, no
matter its size. Even multi-megabyte documents must be completely decoded just
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently
encoded JSON object, common object fields are redundantly serialized over and
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns
of a JSON column. One common solution is to extract commonly occuring object
keys into a dedicated column, also known as object shredding. Yet, this
requires substantial engineering effort and implementation details vary by
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls,
objects, and arrays. This simplicity comes at a cost, as specialized data types
like timestamps and UUIDs must be encoded as strings, then inferred and parsed
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native
types organized into three categories: primitives, objects, and arrays. Objects
and arrays work exactly like their JSON counterparts, supporting arbitrary
nested structures. Variant extends the primitive category to include 20
specialized types. Dates, timestamps, UUIDs, binary data, and integers and
floats of various width all have their own native representations. Values are
encoded in type-specific binary formats, optimized for their native type rather
than a stringifed representation. This reduces both storage-overhead and
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21"
}` thousands of times, the field names `user_id` and `timestamp` are written
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata`
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value`
column stores the actual data. When encoding an object, field names aren't
written inline. Instead, each field references its name by offset position in
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build
a single metadata dictionary per column chunk, containing the union of all
unique field names across every row in that chunk. Rows with different schemas
both reference the same shared dictionary. This way, each unique field name
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order
by field name. This ordering constraint enables efficient field lookups:
finding a value by field name takes O(log(n)) time by binary search, where n is
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field.
For deeply nested objects with dozens of fields, this difference compounds
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to
access any field, as the entire data must be decoded just to read the single
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The
specification defines how to extract frequently accessed fields from a Variant
value and store them as separate typed columns in the file format. For example,
if you frequently query a timestamp field called `start_timestamp` or a 64-bit
integer field called `user_id`, it can be shredded into dedicated timestamp and
integer columns alongside the Variant columns.
+
Review Comment:
It would also be good to add a link to the actual shredding spec for people
to refer if they wanted more etails
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]