Re: [PR] wip: Variant blog post [arrow-site]

via GitHub Wed, 26 Nov 2025 12:28:36 -0800


alamb commented on code in PR #739:
URL: https://github.com/apache/arrow-site/pull/739#discussion_r2566369720



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?

Review Comment:
   It might help here to also point out the benefits of JSON. Namely
   1. You don't have to define your schema up front before loading it into the 
OLAP system( the biggest I think)
   1. it can tolerate different structures row by row (the messy part)
   
   Supporting JSON/semi-structured types make OLAP systems applicable for many 
tasks traditionally done *before* loading into the data system (like ETL), 
enabling users to take advantge of the many great OLAP engines.



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.

Review Comment:
   it might also be worth emphasizing again that single field reads require 
both decoding the string, and then *parsing* that string into an in memory 
structure, and then walking that in memory structure for each row



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.

Review Comment:
   I suggest adding links (e.g. to the crates.io arrow page)



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently 
encoded JSON object, common object fields are redundantly serialized over and 
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns 
of a JSON column. One common solution is to extract commonly occuring object 
keys into a dedicated column, also known as object shredding. Yet, this 
requires substantial engineering effort and implementation details vary by 
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to 
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls, 
objects, and arrays. This simplicity comes at a cost, as specialized data types 
like timestamps and UUIDs must be encoded as strings, then inferred and parsed 
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native 
types organized into three categories: primitives, objects, and arrays. Objects 
and arrays work exactly like their JSON counterparts, supporting arbitrary 
nested structures. Variant extends the primitive category to include 20 
specialized types. Dates, timestamps, UUIDs, binary data, and integers and 
floats of various width all have their own native representations. Values are 
encoded in type-specific binary formats, optimized for their native type rather 
than a stringifed representation. This reduces both storage-overhead and 
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON 
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding 
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" 
}` thousands of times, the field names `user_id` and `timestamp` are written 
out in full for each row, even though they're identical across rows.

Review Comment:
   this is a good example, I think we could make a nice diagram that 
illustrates the point nicely -- aka a bunch of JSON documents with repeated 
fields showing that when encoded as JSON the strings get repeated over and over
   
   Then we can add a "here is how you would encode it as variant" showing both 
the single copy of field names as well as the split into two different fields 🤔 



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently 
encoded JSON object, common object fields are redundantly serialized over and 
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns 
of a JSON column. One common solution is to extract commonly occuring object 
keys into a dedicated column, also known as object shredding. Yet, this 
requires substantial engineering effort and implementation details vary by 
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to 
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls, 
objects, and arrays. This simplicity comes at a cost, as specialized data types 
like timestamps and UUIDs must be encoded as strings, then inferred and parsed 
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native 
types organized into three categories: primitives, objects, and arrays. Objects 
and arrays work exactly like their JSON counterparts, supporting arbitrary 
nested structures. Variant extends the primitive category to include 20 
specialized types. Dates, timestamps, UUIDs, binary data, and integers and 
floats of various width all have their own native representations. Values are 
encoded in type-specific binary formats, optimized for their native type rather 
than a stringifed representation. This reduces both storage-overhead and 
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON 
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding 
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" 
}` thousands of times, the field names `user_id` and `timestamp` are written 
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata` 
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value` 
column stores the actual data. When encoding an object, field names aren't 
written inline. Instead, each field references its name by offset position in 
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a 
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build 
a single metadata dictionary per column chunk, containing the union of all 
unique field names across every row in that chunk. Rows with different schemas 
both reference the same shared dictionary. This way, each unique field name 
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata 
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order 
by field name. This ordering constraint enables efficient field lookups: 
finding a value by field name takes O(log(n)) time by binary search, where n is 
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field. 
For deeply nested objects with dozens of fields, this difference compounds 
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search 
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to 
access any field, as the entire data must be decoded just to read the single 
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The 
specification defines how to extract frequently accessed fields from a Variant 
value and store them as separate typed columns in the file format. For example, 
if you frequently query a timestamp field called `start_timestamp` or a 64-bit 
integer field called `user_id`, it can be shredded into dedicated timestamp and 
integer columns alongside the Variant columns.
+
+This enables columnar file formats like Parquet to leverage their full 
optimization toolset Suppose you had to evaluate a query like:
+
+```sql
+SELECT user_id, start_timestamp from events
+WHERE start_timestamp > '2025-01-01' AND user_id = 12345;
+```
+
+Zone maps can skip entire row groups where `start_timestamp` falls outside the 
query range. Dictionary encoding can compress repeated `user_id` values 
efficiently. Bloom filters can rule out row groups without your target user. 
More importantly, you only deserialize the shredded columns you need, not the 
Variant columns.
+
+Shredding makes the trade-off explicit: extract frequently queried fields into 
optimized columns at write time, and keep everything else flexible. You get 
columnar performance for common access patterns and schema flexibility for 
everything else.

Review Comment:
   I think it would help to show how this would work in practice. 
   
   ```sql
   SELECT count(*) from events
   WHERE event["start_timestamp"] > '2025-01-01' AND event["user_id"] = 12345;
   ```
   
   To
   ```sql
   SELECT count(*) from events
   WHERE typed_value.start_timestamp > '2025-01-01' AND typed_value.user_id = 
12345;
   ```
   
   
   you would have to then explain that predicate pushdown works for the typed 
typed column



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently 
encoded JSON object, common object fields are redundantly serialized over and 
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns 
of a JSON column. One common solution is to extract commonly occuring object 
keys into a dedicated column, also known as object shredding. Yet, this 
requires substantial engineering effort and implementation details vary by 
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to 
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls, 
objects, and arrays. This simplicity comes at a cost, as specialized data types 
like timestamps and UUIDs must be encoded as strings, then inferred and parsed 
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native 
types organized into three categories: primitives, objects, and arrays. Objects 
and arrays work exactly like their JSON counterparts, supporting arbitrary 
nested structures. Variant extends the primitive category to include 20 
specialized types. Dates, timestamps, UUIDs, binary data, and integers and 
floats of various width all have their own native representations. Values are 
encoded in type-specific binary formats, optimized for their native type rather 
than a stringifed representation. This reduces both storage-overhead and 
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON 
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding 
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" 
}` thousands of times, the field names `user_id` and `timestamp` are written 
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata` 
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value` 
column stores the actual data. When encoding an object, field names aren't 
written inline. Instead, each field references its name by offset position in 
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a 
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build 
a single metadata dictionary per column chunk, containing the union of all 
unique field names across every row in that chunk. Rows with different schemas 
both reference the same shared dictionary. This way, each unique field name 
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata 
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order 
by field name. This ordering constraint enables efficient field lookups: 
finding a value by field name takes O(log(n)) time by binary search, where n is 
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field. 
For deeply nested objects with dozens of fields, this difference compounds 
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search 
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to 
access any field, as the entire data must be decoded just to read the single 
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The 
specification defines how to extract frequently accessed fields from a Variant 
value and store them as separate typed columns in the file format. For example, 
if you frequently query a timestamp field called `start_timestamp` or a 64-bit 
integer field called `user_id`, it can be shredded into dedicated timestamp and 
integer columns alongside the Variant columns.

Review Comment:
   ```suggestion
   Variant solves this problem by standardizing a shredding specification. The 
specification defines how to extract frequently accessed fields from a Variant 
value and store them as separate typed columns in the file format. For example, 
if you frequently query a timestamp field called `start_timestamp` or a 64-bit 
integer field called `user_id`, it can be shredded into dedicated timestamp and 
integer columns alongside the Variant columns.
   ```
   
   



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.

Review Comment:
   ```suggestion
   This article explains the limitations of JSON, introduces the Variant type 
that solves these problems, and explains why you should be excited about it by 
analyzing performance characteristics on a real world dataset.
   ```



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently 
encoded JSON object, common object fields are redundantly serialized over and 
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns 
of a JSON column. One common solution is to extract commonly occuring object 
keys into a dedicated column, also known as object shredding. Yet, this 
requires substantial engineering effort and implementation details vary by 
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to 
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls, 
objects, and arrays. This simplicity comes at a cost, as specialized data types 
like timestamps and UUIDs must be encoded as strings, then inferred and parsed 
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native 
types organized into three categories: primitives, objects, and arrays. Objects 
and arrays work exactly like their JSON counterparts, supporting arbitrary 
nested structures. Variant extends the primitive category to include 20 
specialized types. Dates, timestamps, UUIDs, binary data, and integers and 
floats of various width all have their own native representations. Values are 
encoded in type-specific binary formats, optimized for their native type rather 
than a stringifed representation. This reduces both storage-overhead and 
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON 
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding 
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" 
}` thousands of times, the field names `user_id` and `timestamp` are written 
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata` 
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value` 
column stores the actual data. When encoding an object, field names aren't 
written inline. Instead, each field references its name by offset position in 
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a 
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build 
a single metadata dictionary per column chunk, containing the union of all 
unique field names across every row in that chunk. Rows with different schemas 
both reference the same shared dictionary. This way, each unique field name 
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata 
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order 
by field name. This ordering constraint enables efficient field lookups: 
finding a value by field name takes O(log(n)) time by binary search, where n is 
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field. 
For deeply nested objects with dozens of fields, this difference compounds 
quickly.
+
+_maybe i should add some benchmarks here?_

Review Comment:
   I think this is a nice detail but probably not worth belaboring -- this 
section is an example of  "efficient binary encoding" I think 



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently 
encoded JSON object, common object fields are redundantly serialized over and 
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns 
of a JSON column. One common solution is to extract commonly occuring object 
keys into a dedicated column, also known as object shredding. Yet, this 
requires substantial engineering effort and implementation details vary by 
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to 
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls, 
objects, and arrays. This simplicity comes at a cost, as specialized data types 
like timestamps and UUIDs must be encoded as strings, then inferred and parsed 
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native 
types organized into three categories: primitives, objects, and arrays. Objects 
and arrays work exactly like their JSON counterparts, supporting arbitrary 
nested structures. Variant extends the primitive category to include 20 
specialized types. Dates, timestamps, UUIDs, binary data, and integers and 
floats of various width all have their own native representations. Values are 
encoded in type-specific binary formats, optimized for their native type rather 
than a stringifed representation. This reduces both storage-overhead and 
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON 
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding 
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" 
}` thousands of times, the field names `user_id` and `timestamp` are written 
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata` 
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value` 
column stores the actual data. When encoding an object, field names aren't 
written inline. Instead, each field references its name by offset position in 
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a 
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build 
a single metadata dictionary per column chunk, containing the union of all 
unique field names across every row in that chunk. Rows with different schemas 
both reference the same shared dictionary. This way, each unique field name 
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata 
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order 
by field name. This ordering constraint enables efficient field lookups: 
finding a value by field name takes O(log(n)) time by binary search, where n is 
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field. 
For deeply nested objects with dozens of fields, this difference compounds 
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search 
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities

Review Comment:
   I know this is a general capability applicable to all (columnar) file 
formats, but since this blog is about the Variant that was added to Parquet, I 
recommend you focus this part of Parquet ("Variant can leverage Parquet's 
columnar architecture" or something)



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently 
encoded JSON object, common object fields are redundantly serialized over and 
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns 
of a JSON column. One common solution is to extract commonly occuring object 
keys into a dedicated column, also known as object shredding. Yet, this 
requires substantial engineering effort and implementation details vary by 
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to 
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls, 
objects, and arrays. This simplicity comes at a cost, as specialized data types 
like timestamps and UUIDs must be encoded as strings, then inferred and parsed 
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native 
types organized into three categories: primitives, objects, and arrays. Objects 
and arrays work exactly like their JSON counterparts, supporting arbitrary 
nested structures. Variant extends the primitive category to include 20 
specialized types. Dates, timestamps, UUIDs, binary data, and integers and 
floats of various width all have their own native representations. Values are 
encoded in type-specific binary formats, optimized for their native type rather 
than a stringifed representation. This reduces both storage-overhead and 
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON 
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding 
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" 
}` thousands of times, the field names `user_id` and `timestamp` are written 
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata` 
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value` 
column stores the actual data. When encoding an object, field names aren't 
written inline. Instead, each field references its name by offset position in 
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a 
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build 
a single metadata dictionary per column chunk, containing the union of all 
unique field names across every row in that chunk. Rows with different schemas 
both reference the same shared dictionary. This way, each unique field name 
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata 
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order 
by field name. This ordering constraint enables efficient field lookups: 
finding a value by field name takes O(log(n)) time by binary search, where n is 
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field. 
For deeply nested objects with dozens of fields, this difference compounds 
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search 
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to 
access any field, as the entire data must be decoded just to read the single 
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The 
specification defines how to extract frequently accessed fields from a Variant 
value and store them as separate typed columns in the file format. For example, 
if you frequently query a timestamp field called `start_timestamp` or a 64-bit 
integer field called `user_id`, it can be shredded into dedicated timestamp and 
integer columns alongside the Variant columns.
+
+This enables columnar file formats like Parquet to leverage their full 
optimization toolset Suppose you had to evaluate a query like:
+
+```sql
+SELECT user_id, start_timestamp from events
+WHERE start_timestamp > '2025-01-01' AND user_id = 12345;
+```
+
+Zone maps can skip entire row groups where `start_timestamp` falls outside the 
query range. Dictionary encoding can compress repeated `user_id` values 
efficiently. Bloom filters can rule out row groups without your target user. 
More importantly, you only deserialize the shredded columns you need, not the 
Variant columns.
+
+Shredding makes the trade-off explicit: extract frequently queried fields into 
optimized columns at write time, and keep everything else flexible. You get 
columnar performance for common access patterns and schema flexibility for 
everything else.
+
+## Why you should be excited about Variant
+
+In this section, we'll explore the performance characteristics of JSON and 
Variant data. We'll use Clickhouse's JSON Bench 
[dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events),
 a billion-record collection of Bluesky events designed to represent real-world 
production data.
+
+Our benchmarks focus on three key metrics: write performance when serializing 
data to Parquet files, storage efficiency comparing both compressed and 
uncompressed file sizes, and query perfomance across common access patterns.
+
+For query execution, we use Datafusion, a popular query engine. To work with 
Variant in Datafusion, we use 
[datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant), 
a library that implements native Variant type support.
+

Review Comment:
   this looks great -- I can't wait for this section!



##########
_posts/2025-11-25-variant.md:
##########
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Variant: tbd"
+author: friendlymatthew
+date: "2025-11-25 00:00:00"
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP 
systems, and its initial implementation has been released in Arrow 57.0.0.
+
+This article explains the limitations of JSON, how Variant solves these 
problems, and why you should be excited about it by analyzing performance 
characteristics on a real world dataset.
+
+## Whats the problem with JSON?
+
+Many real world datasets are messy and lack a rigid schema. As a result, it's 
common to store such data as a JSON column.
+
+The access pattern of this model looks like:
+
+- **read**: for every row you scan, decode the entire JSON into memory and 
evaluate it
+- **write**: for every record you write, encode it as its own JSON value
+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires 
scanning every row and deserializing the entire JSON payload into memory, no 
matter its size. Even multi-megabyte documents must be completely decoded just 
to inspect a single field.
+
+_Write efficiency is also poor._ Because each row stores an independently 
encoded JSON object, common object fields are redundantly serialized over and 
over. This increases storage size and adds unnecessary encoding overhead.
+
+These problems aren't new and solutions exist that improve the access patterns 
of a JSON column. One common solution is to extract commonly occuring object 
keys into a dedicated column, also known as object shredding. Yet, this 
requires substantial engineering effort and implementation details vary by 
query engine.
+
+## How Variant solves these problems
+
+Variant is a data type with an efficient binary encoding. It's designed to 
store JSON data in a way that is more performant for OLAP query engines.
+
+### Variant has richer data types
+
+JSON is limited to just six data types: strings, numbers, booleans, nulls, 
objects, and arrays. This simplicity comes at a cost, as specialized data types 
like timestamps and UUIDs must be encoded as strings, then inferred and parsed 
back at query time.
+
+Variant removes this overhead by supporting a much broader range of native 
types organized into three categories: primitives, objects, and arrays. Objects 
and arrays work exactly like their JSON counterparts, supporting arbitrary 
nested structures. Variant extends the primitive category to include 20 
specialized types. Dates, timestamps, UUIDs, binary data, and integers and 
floats of various width all have their own native representations. Values are 
encoded in type-specific binary formats, optimized for their native type rather 
than a stringifed representation. This reduces both storage-overhead and 
query-time parsing costs.
+
+_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON 
and Variant logically and physically]_
+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding 
overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" 
}` thousands of times, the field names `user_id` and `timestamp` are written 
out in full for each row, even though they're identical across rows.
+
+Variant avoids such overhead by splitting data across 2 columns: `metadata` 
and `value`.
+
+The `metadata` column stores a dictionary of unique field names. The `value` 
column stores the actual data. When encoding an object, field names aren't 
written inline. Instead, each field references its name by offset position in 
the metadata dictionary.
+
+_[Figure: diagram of a primitive variant and variant object beign written to a 
metadata and value column]_
+
+At the file level, this design enables a powerful optimization. You can build 
a single metadata dictionary per column chunk, containing the union of all 
unique field names across every row in that chunk. Rows with different schemas 
both reference the same shared dictionary. This way, each unique field name 
appears exactly once in the dictionary, even if it's used in thousands of rows.
+
+_[Figure: diagram of multiple variants pointing to the same metadata 
dictionary]_
+
+### Variant guarantees faster search performance
+
+When objects are encoded, field entries must be written in lexicographic order 
by field name. This ordering constraint enables efficient field lookups: 
finding a value by field name takes O(log(n)) time by binary search, where n is 
the number of fields in the object.
+
+Without this guarantee, you'd need to sequentially scan through every field. 
For deeply nested objects with dozens of fields, this difference compounds 
quickly.
+
+_maybe i should add some benchmarks here?_
+
+The metadata dictionary can similarly be optimized for faster search 
performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities
+
+Even with fast field lookups, Variant still requires full deserialization to 
access any field, as the entire data must be decoded just to read the single 
value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The 
specification defines how to extract frequently accessed fields from a Variant 
value and store them as separate typed columns in the file format. For example, 
if you frequently query a timestamp field called `start_timestamp` or a 64-bit 
integer field called `user_id`, it can be shredded into dedicated timestamp and 
integer columns alongside the Variant columns.
+

Review Comment:
   It would also be good to add a link to the actual shredding spec for people 
to refer if they wanted more etails



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] wip: Variant blog post [arrow-site]

Reply via email to