lidavidm commented on code in PR #47456:
URL: https://github.com/apache/arrow/pull/47456#discussion_r2308896395


##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:

Review Comment:
   ```suggestion
   * The storage type of this extension is a ``Struct`` that obeys the 
following rules:
   ```
   
   



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``

Review Comment:
   ```suggestion
       * A field named ``typed_value`` which can be any :term:`primitive type` 
or a ``List``, ``LargeList``, ``ListView`` or ``Struct``
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a *nested* type, its elements **must** 
be *non-nullable* and **must** be a ``struct`` consisting of
+        at least one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows these same rules for the 
``typed_value`` field.
+
+.. note::
+
+  It is also *permissible* for the ``metadata`` field to be dictionary-encoded 
with an index type of ``int8``.
+
+.. note::
+
+  The fields may be in any order, and thus must be accessed by **name** not by 
*position*.

Review Comment:
   ```suggestion
   .. note::
   
      The fields may be in any order, and thus must be accessed by **name** not 
by *position*.
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a *nested* type, its elements **must** 
be *non-nullable* and **must** be a ``struct`` consisting of
+        at least one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows these same rules for the 
``typed_value`` field.
+
+.. note::
+
+  It is also *permissible* for the ``metadata`` field to be dictionary-encoded 
with an index type of ``int8``.
+
+.. note::
+
+  The fields may be in any order, and thus must be accessed by **name** not by 
*position*.
+
+Examples
+--------
+
+Unshredded
+''''''''''
+
+The simplest case, an unshredded variant always consists of **exactly** two 
fields: ``metadata`` and ``value``. Any of
+the following storage types are valid (not an exhaustive list):
+
+* ``struct<metadata: binary required, value: binary required>``
+* ``struct<value: binary optional, metadata: binary required>``
+* ``struct<metadata: dictionary<int8, binary> required, value: binary_view 
required>``
+
+Simple Shredding
+''''''''''''''''
+
+Suppose a Variant field named *measurement* and we want to shred the ``int64`` 
values into a separate column for efficiency.

Review Comment:
   ```suggestion
   Suppose we have a Variant field named *measurement* and we want to shred the 
``int64`` values into a separate column for efficiency.
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a *nested* type, its elements **must** 
be *non-nullable* and **must** be a ``struct`` consisting of
+        at least one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows these same rules for the 
``typed_value`` field.
+
+.. note::
+
+  It is also *permissible* for the ``metadata`` field to be dictionary-encoded 
with an index type of ``int8``.
+
+.. note::
+
+  The fields may be in any order, and thus must be accessed by **name** not by 
*position*.
+
+Examples
+--------
+
+Unshredded
+''''''''''
+
+The simplest case, an unshredded variant always consists of **exactly** two 
fields: ``metadata`` and ``value``. Any of
+the following storage types are valid (not an exhaustive list):
+
+* ``struct<metadata: binary required, value: binary required>``
+* ``struct<value: binary optional, metadata: binary required>``
+* ``struct<metadata: dictionary<int8, binary> required, value: binary_view 
required>``
+
+Simple Shredding
+''''''''''''''''
+
+Suppose a Variant field named *measurement* and we want to shred the ``int64`` 
values into a separate column for efficiency.
+In Parquet, this could be represented as::
+
+  required group measurement (VARIANT) {
+    required binary metadata;
+    optional binary value;
+    optional int64 typed_value;
+  }
+
+Thus the corresponding storage type for the ``parquet.variant`` Arrow 
extension type would be: ::
+
+  struct<
+    metadata: binary required,
+    value: binary optional,
+    typed_value: int64 optional
+  >
+
+If we suppose a series of measurements consisting of: ::
+
+  34, null, "n/a", 100
+
+The data should be stored/represented in Arrow as: ::
+
+  * Length: 4, Null count: 1
+  * Validity Bitmap buffer:
+
+    | Byte 0 (validity bitmap) | Bytes 1-63    |
+    |--------------------------|---------------|
+    | 00001011                 | 0 (padding)   |
+
+  * Children arrays:
+    * field-0 array (`VarBinary`)
+      * Length: 4, Null count: 0
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 2, 4, 6, 8    | unspecified (padding)    |
+
+      * Value buffer: (01 00 -> indicates version 1 empty metadata)
+
+        | Bytes 0-7          | Bytes 8-63               |
+        |--------------------|--------------------------|
+        | 01 00 01 00 01 00  | unspecified (padding)    |
+
+    * field-1 array (`VarBinary`)
+      * Length: 4, Null count: 2
+      * Validity Bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00000110                 | 0 (padding)   |
+
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 0, 1, 5, 5    | unspecified (padding)    |
+
+      * Value buffer: (`00` -> literal null, `0x13 0x6E 0x2F 0x61` -> variant 
encoding literal string "n/a")
+
+        | Bytes 0-4              | Bytes 5-63               |
+        |------------------------|--------------------------|
+        | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding)    |
+
+    * field-2 array (int64 array)
+      * Length: 4, Null count: 2
+      * Validity Bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00001001                 | 0 (padding)   |
+
+      * Value buffer:
+
+        | Bytes 0-31          | Bytes 32-63               |
+        |---------------------|--------------------------|
+        | 34, 00, 00, 100     | unspecified (padding)    |
+
+.. note::
+
+  Notice that there is a variant ``literal null`` in the ``value`` array, this 
is due to the
+  `shredding specification 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding>`__
+  so that a consumer can tell the difference between a *missing* field and a 
**null** field.
+
+Shredding an Array
+''''''''''''''''''
+
+For our next example, we will represent a shredded array of strings. Let's 
consider a column that looks like: ::
+
+  ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null
+
+Representing this shredded variant in Parquet could look like: ::
+
+  optional group tags (VARIANT) {
+    required binary metadata;
+    optional binary value;
+    optional group typed_value (LIST) { # optional to allow null lists
+      repeated group list {
+        required group element {        # shredded element
+          optional binary value;
+          optional binary typed_value (STRING);
+        }
+      }
+    }
+  }
+
+The array structure for Variant encoding does not allow missing elements, so 
all elements of the array must
+be *non-nullable*. As such, either **typed_value** or **value** (*but not 
both!*) must be *non-null*. A null
+element must be encoded as a Variant null: *basic type* ``0`` (primitive) and 
*physical type* ``0`` (null).
+
+The storage type to represent this in Arrow as a Variant extension type would 
be: ::
+
+  struct<
+    metadata: binary required,
+    value: binary optional,
+    typed_value: list<element: struct<
+      value: binary optional,
+      typed_value: string optional
+    > required> optional
+  >
+
+.. note::
+
+  As usual, **Binary** could also be **LargeBinary** or **BinaryView**, 
**String** could also be **LargeString** or **StringView**,
+  and **List** could also be **LargeList** or **ListView**.
+
+The data would then be stored in Arrow as follows: ::
+
+  * Length: 4, Null count: 1
+  * Validity Bitmap buffer:
+
+    | Byte 0 (validity bitmap) | Bytes 1-63    |
+    |--------------------------|---------------|
+    | 00000111                 | 0 (padding)   |
+
+  * Children arrays:
+    * field-0 array (`VarBinary` Metadata)
+      * Length: 4, Null count: 0
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 2, 4, 6, 8    | unspecified (padding)    |
+
+      * Value buffer: (01 00 -> indicates version 1 empty metadata)
+
+        | Bytes 0-7          | Bytes 8-63               |
+        |--------------------|--------------------------|
+        | 01 00 01 00 01 00  | unspecified (padding)    |
+
+    * field-1 array (`VarBinary` Value)
+      * Length: 4, Null count: 1
+      * Validity bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00001000                 | 0 (padding)   |
+
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 0, 0, 0, 1    | unspecified (padding)    |
+
+      * Value buffer: (00 -> variant null)
+
+        | Bytes 0            | Bytes 1-63               |
+        |--------------------|--------------------------|
+        | 00                 | unspecified (padding)    |
+
+    * field-2 array (`List<Struct<VarBinary, String>>` typed_value)
+      * Length: 4, Null count: 1
+      * Validity bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63  |
+        |--------------------------|-------------|
+        | 00000111                 | 0 (padding) |
+
+      * Offsets buffer (int32)
+
+        | Bytes 0-19        | Bytes 20-63           |
+        |-------------------|-----------------------|
+        | 0, 2, 4, 7, 7     | unspecified (padding) |
+
+      * Values array (`Struct<VarBinary, String>` element):
+        * Length: 7, Null count: 0
+        * Validity bitmap buffer: Not required
+
+        * Children arrays:
+          * field-0 array (`VarBinary` value)
+            * Length: 7, Null count: 6
+            * Validity bitmap buffer:
+
+              | Byte 0 (validity bitmap) | Bytes 1-63  |
+              |--------------------------|-------------|
+              | 00001000                 | 0 (padding) |
+
+            * Offsets buffer (int32):
+
+              | Bytes 0-31                | Bytes 32-63              |
+              |---------------------------|--------------------------|
+              | 0, 0, 0, 0, 1, 1, 1, 1    | unspecified (padding)    |
+
+            * Values buffer (`00` -> variant null):
+
+              | Bytes 0            | Bytes 1-63               |
+              |--------------------|--------------------------|
+              | 00                 | unspecified (padding)    |
+
+          * field-1 array (`String` typed_value)
+            * Length: 7, Null count: 1
+            * Validity bitmap buffer:
+
+              | Byte 0 (validity bitmap) | Bytes 1-63  |
+              |--------------------------|-------------|
+              | 01110111                 | 0 (padding) |
+
+            * Offsets buffer (int32):
+
+              | Bytes 0-31                      | Bytes 32-63              |
+              |---------------------------------|--------------------------|
+              | 0, 6, 11, 17, 17, 23, 28, 35    | unspecified (padding)    |
+
+            * Values buffer:
+
+              | Bytes 0-35                           | Bytes 36-63             
 |
+              
|--------------------------------------|--------------------------|
+              | comedydramahorrorcomedydramaromance  | unspecified (padding)   
 |
+
+Shredding an Object
+'''''''''''''''''''
+
+Let's consider a simple JSON column of "events" which contain a field named 
``event_type`` (a string)
+and a field named ``event_ts`` (a timestamp) that we wish to shred into 
separate columns, In Parquet,
+it could look something like this: ::
+
+  optional group event (VARIANT) {
+    required binary metadata;
+    optional binary value;        # variant, remaining fields/values
+    optional group typed_value {  # shredded fields for variant object
+      required group event_type { # event_type shredded field
+        optional binary value;
+        optional binary typed_value (STRING);
+      }
+      required group event_ts {   # event_ts shredded field
+        optional binary value;
+        optional int64 typed_value (TIMESTAMP(true, MICROS))
+      }
+    }
+  }
+
+We can then, fairly easily, translate this into the expected extension storage 
type: ::
+
+  struct<
+    metadata: binary required,
+    value: binary optional,
+    typed_value: struct<
+      event_type: struct<
+        value: binary optional,
+        typed_value: string optional
+      > required,
+      event_ts: struct<
+        value: binary optional,
+        typed_value: timestamp(us, UTC) optional
+      > required
+    > optional
+  >
+
+If a field *does not exist* in the variant object value, then both the 
**value** and **typed_value** columns for that row
+will be null. If a field is *present*, but the value is null, then **value** 
must contain a Variant null: *basic type*
+``0`` (primitive) and *physical type* ``0`` (null).
+
+It is *invalid* for both **value** and **typed_value** to be non-null for a 
given index. A reader can choose not to error
+in this scenario, but if so it **must** use the value in the **typed_value** 
column for that index.
+
+Let's consider the following series of objects: ::
+
+  {"event_type": "noop", "event_ts": 1729794114937}
+
+  {"event_type": "login", "event_ts": 1729794146402, "email": 
"u...@example.com"}
+
+  {"error_msg": "malformed..."}
+
+  "malformed: not an object"
+
+  {"event_ts": 1729794240241, "click": "_button"}
+
+  {"event_ts": null, "event_ts": 1729794954163}
+
+  {"event_type": "noop", "event_ts": "2024-10-24"}
+
+  {}
+
+  null
+
+  *Entirely missing*
+
+To represent those values as a column of Variant values using the Variant 
extension type we get the following: ::
+
+  * Length: 10, Null count: 1
+  * Validity bitmap buffer:
+
+    | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
+    |--------------------------|-----------|-----------------------|
+    | 11111111                 | 00000001  | 0 (padding)           |
+
+  * Children arrays
+    * field-0 array (`VarBinary` Metadata)
+      * Length: 10, Null count: 0
+      * Offsets buffer:
+
+        | Bytes 0-43 (int32)                       | Bytes 44-63             |
+        |------------------------------------------|-------------------------|
+        | 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding)   |
+
+      * Value buffer: (01 00 -> version 1 empty metadata,
+                       01 01 00 XX ... -> Version 1, metadata with 1 elem, 
offset 0, offset XX == len(string), ... is dict string bytes)
+
+        | Bytes 0-1 | Bytes 2-10        | Bytes 11-23           | Bytes 24-25 
| Bytes 26-34       |
+        
|-------------------------------|-----------------------|-------------|-------------------|
+        | 01 00     | 01 01 00 05 email | 01 01 00 09 error_msg | 01 00       
| 01 01 00 05 click |
+
+        | Bytes 35-36 | Bytes 37-38 | Bytes 39-40 | Bytes 41-42 | Bytes 43-44 
| Bytes 45-63           |
+        
|-------------|-------------|-------------|-------------|-------------|-----------------------|
+        | 01 00       | 01 00       | 01 00       | 01 00       | 01 00       
| unspecified (padding) |
+
+    * field-1 array (`VarBinary` Value)
+      * Length: 10, Null count: 5
+      * Validity bitmap buffer:
+
+        | Byte 0 (validity bitmap)  | Byte 1    | Bytes 2-63            |
+        |---------------------------|-----------|-----------------------|
+        | 00011110                  | 00000001  | 0 (padding)           |
+
+      * Offsets buffer (filled in based on lengths of encoded variants):
+
+        | ... |
+
+      * Value buffer:
+
+        | VariantEncode({"email": "u...@email.com"}) | 
VariantEncode({"error_msg": "malformed..."}) |
+        | VariantEncode("malformed: not an object")  | VariantEncode({"click": 
"_button"})          | 00 (null) |
+
+    * field-2 array (`Struct<...>` typed_value)
+      * Length: 10, Null count: 3
+      * Validity bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
+        |--------------------------|-----------|-----------------------|
+        | 11110111                 | 00000000  | 0 (padding)           |
+
+      * Children arrays:
+        * field-0 array (`Struct<VarBinary, String>` event_type)
+          * Length: 10, Null count: 0
+          * Validity bitmap buffer: not required
+
+          * Children arrays
+            * field-0 array (`VarBinary` value)
+              * Length: 10, Null count: 9
+              * Validity bitmap buffer:
+
+                | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            
|
+                
|--------------------------|-----------|-----------------------|
+                | 01000000                 | 00000000  | 0 (padding)           
|
+
+              * Offsets buffer (int32)
+
+                | Bytes 0-43 (int32)              | Bytes 44-63             |
+                |---------------------------------|-------------------------|
+                | 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 | unspecified (padding)   |
+
+              * Value buffer:
+
+                | Byte 0 | Bytes 1-63             |
+                |--------|------------------------|
+                | 00     | unspecified (padding)  |
+
+            * field-1 array (`String` typed_value)
+              * Length: 10, Null count: 7
+              * Validity bitmap buffer:
+
+                | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            
|
+                
|--------------------------|-----------|-----------------------|
+                | 01000011                 | 00000000  | 0 (padding)           
|
+
+              * Offsets buffer (int32)
+
+                | Byte 0-43                           | Bytes 44-63            
|
+                
|-------------------------------------|------------------------|
+                | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding)  
|
+
+              * Value buffer:
+
+                | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63            |
+                |-----------|-----------|------------|------------------------|
+                | noop      | login     | noop       | unspecified (padding)  |
+
+
+        * field-1 array (`Struct<VarBinary, Timestamp>` event_ts)
+          * Length: 10, Null count: 0
+          * Validity bitmap buffer: not required
+
+          * Children arrays
+            * field-0 array (`VarBinary` value)
+              * Length: 10, Null count: 9
+              * Validity bitmap buffer:
+
+                | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            
|
+                
|--------------------------|-----------|-----------------------|
+                | 01000000                 | 00000000  | 0 (padding)           
|
+
+              * Offsets buffer (int32)
+
+                | Bytes 0-43 (int32)              | Bytes 44-63             |
+                |---------------------------------|-------------------------|
+                | ...                             | unspecified (padding)   |
+
+              * Value buffer:
+
+                | VariantEncode("2024-10-24")     |
+
+            * field-1 array (`Timestamp(us, UTC)` typed_value)
+              * Length: 10, Null count: 6
+              * Validity bitmap buffer:
+
+                | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            
|
+                
|--------------------------|-----------|-----------------------|
+                | 00110011                 | 00000000  | 0 (padding)           
|
+
+              * Value buffer:
+
+                | Bytes 0-7     | Bytes 8-15    | Bytes 16-31  | Bytes 32-39   
| Bytes 40-47   | Bytes 48-63            |
+                
|---------------|---------------|--------------|---------------|---------------|------------------------|
+                | 1729794114937 | 1729794146402 | unspecified  | 1729794240241 
| 1729794954163 | unspecified (padding)  |
+
+
+Putting it all together
+'''''''''''''''''''''''
+
+As mentioned, the **typed_value** field associated with a Variant **value** 
can be of any shredded type. As a result,
+as long as we follow the original rules you can have an arbitrary number of 
nested levels based on how you want to

Review Comment:
   ```suggestion
   as long as we follow the original rules we can have an arbitrary number of 
nested levels based on how you want to
   ```
   nit but be consistent about 1st/2nd person?



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a *nested* type, its elements **must** 
be *non-nullable* and **must** be a ``struct`` consisting of

Review Comment:
   ```suggestion
         * If the ``typed_value`` field is a *nested* type, its elements 
**must** be *non-nullable* and **must** be a ``Struct`` consisting of
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======

Review Comment:
   nit: should we be clear about labeling this "Parquet Variant"?



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a *nested* type, its elements **must** 
be *non-nullable* and **must** be a ``struct`` consisting of
+        at least one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows these same rules for the 
``typed_value`` field.
+
+.. note::
+
+  It is also *permissible* for the ``metadata`` field to be dictionary-encoded 
with an index type of ``int8``.
+
+.. note::
+
+  The fields may be in any order, and thus must be accessed by **name** not by 
*position*.
+
+Examples
+--------
+
+Unshredded
+''''''''''
+
+The simplest case, an unshredded variant always consists of **exactly** two 
fields: ``metadata`` and ``value``. Any of
+the following storage types are valid (not an exhaustive list):
+
+* ``struct<metadata: binary required, value: binary required>``
+* ``struct<value: binary optional, metadata: binary required>``
+* ``struct<metadata: dictionary<int8, binary> required, value: binary_view 
required>``
+
+Simple Shredding
+''''''''''''''''
+
+Suppose a Variant field named *measurement* and we want to shred the ``int64`` 
values into a separate column for efficiency.
+In Parquet, this could be represented as::
+
+  required group measurement (VARIANT) {
+    required binary metadata;
+    optional binary value;
+    optional int64 typed_value;
+  }
+
+Thus the corresponding storage type for the ``parquet.variant`` Arrow 
extension type would be: ::
+
+  struct<
+    metadata: binary required,
+    value: binary optional,
+    typed_value: int64 optional
+  >
+
+If we suppose a series of measurements consisting of: ::
+
+  34, null, "n/a", 100
+
+The data should be stored/represented in Arrow as: ::
+
+  * Length: 4, Null count: 1
+  * Validity Bitmap buffer:
+
+    | Byte 0 (validity bitmap) | Bytes 1-63    |
+    |--------------------------|---------------|
+    | 00001011                 | 0 (padding)   |
+
+  * Children arrays:
+    * field-0 array (`VarBinary`)
+      * Length: 4, Null count: 0
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 2, 4, 6, 8    | unspecified (padding)    |
+
+      * Value buffer: (01 00 -> indicates version 1 empty metadata)
+
+        | Bytes 0-7          | Bytes 8-63               |
+        |--------------------|--------------------------|
+        | 01 00 01 00 01 00  | unspecified (padding)    |
+
+    * field-1 array (`VarBinary`)
+      * Length: 4, Null count: 2
+      * Validity Bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00000110                 | 0 (padding)   |
+
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 0, 1, 5, 5    | unspecified (padding)    |
+
+      * Value buffer: (`00` -> literal null, `0x13 0x6E 0x2F 0x61` -> variant 
encoding literal string "n/a")
+
+        | Bytes 0-4              | Bytes 5-63               |
+        |------------------------|--------------------------|
+        | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding)    |
+
+    * field-2 array (int64 array)
+      * Length: 4, Null count: 2
+      * Validity Bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00001001                 | 0 (padding)   |
+
+      * Value buffer:
+
+        | Bytes 0-31          | Bytes 32-63               |
+        |---------------------|--------------------------|
+        | 34, 00, 00, 100     | unspecified (padding)    |
+
+.. note::
+
+  Notice that there is a variant ``literal null`` in the ``value`` array, this 
is due to the
+  `shredding specification 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding>`__
+  so that a consumer can tell the difference between a *missing* field and a 
**null** field.

Review Comment:
   We implicitly describe nulls here and then explicitly describe it below; 
would it make sense to explicitly introduce nulls first (and link to the docs) 
and then omit both of the other explanations?



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a *nested* type, its elements **must** 
be *non-nullable* and **must** be a ``struct`` consisting of
+        at least one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows these same rules for the 
``typed_value`` field.
+
+.. note::
+
+  It is also *permissible* for the ``metadata`` field to be dictionary-encoded 
with an index type of ``int8``.
+
+.. note::
+
+  The fields may be in any order, and thus must be accessed by **name** not by 
*position*.
+
+Examples
+--------
+
+Unshredded
+''''''''''
+
+The simplest case, an unshredded variant always consists of **exactly** two 
fields: ``metadata`` and ``value``. Any of
+the following storage types are valid (not an exhaustive list):
+
+* ``struct<metadata: binary required, value: binary required>``
+* ``struct<value: binary optional, metadata: binary required>``
+* ``struct<metadata: dictionary<int8, binary> required, value: binary_view 
required>``
+
+Simple Shredding
+''''''''''''''''
+
+Suppose a Variant field named *measurement* and we want to shred the ``int64`` 
values into a separate column for efficiency.
+In Parquet, this could be represented as::
+
+  required group measurement (VARIANT) {
+    required binary metadata;
+    optional binary value;
+    optional int64 typed_value;
+  }
+
+Thus the corresponding storage type for the ``parquet.variant`` Arrow 
extension type would be: ::
+
+  struct<
+    metadata: binary required,
+    value: binary optional,
+    typed_value: int64 optional
+  >
+
+If we suppose a series of measurements consisting of: ::
+
+  34, null, "n/a", 100
+
+The data should be stored/represented in Arrow as: ::
+
+  * Length: 4, Null count: 1
+  * Validity Bitmap buffer:
+
+    | Byte 0 (validity bitmap) | Bytes 1-63    |
+    |--------------------------|---------------|
+    | 00001011                 | 0 (padding)   |
+
+  * Children arrays:
+    * field-0 array (`VarBinary`)
+      * Length: 4, Null count: 0
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 2, 4, 6, 8    | unspecified (padding)    |
+
+      * Value buffer: (01 00 -> indicates version 1 empty metadata)
+
+        | Bytes 0-7          | Bytes 8-63               |
+        |--------------------|--------------------------|
+        | 01 00 01 00 01 00  | unspecified (padding)    |
+
+    * field-1 array (`VarBinary`)
+      * Length: 4, Null count: 2
+      * Validity Bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00000110                 | 0 (padding)   |
+
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 0, 1, 5, 5    | unspecified (padding)    |
+
+      * Value buffer: (`00` -> literal null, `0x13 0x6E 0x2F 0x61` -> variant 
encoding literal string "n/a")
+
+        | Bytes 0-4              | Bytes 5-63               |
+        |------------------------|--------------------------|
+        | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding)    |
+
+    * field-2 array (int64 array)
+      * Length: 4, Null count: 2
+      * Validity Bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00001001                 | 0 (padding)   |
+
+      * Value buffer:
+
+        | Bytes 0-31          | Bytes 32-63               |
+        |---------------------|--------------------------|
+        | 34, 00, 00, 100     | unspecified (padding)    |
+
+.. note::
+
+  Notice that there is a variant ``literal null`` in the ``value`` array, this 
is due to the
+  `shredding specification 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding>`__
+  so that a consumer can tell the difference between a *missing* field and a 
**null** field.
+
+Shredding an Array
+''''''''''''''''''
+
+For our next example, we will represent a shredded array of strings. Let's 
consider a column that looks like: ::
+
+  ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null
+
+Representing this shredded variant in Parquet could look like: ::
+
+  optional group tags (VARIANT) {
+    required binary metadata;
+    optional binary value;
+    optional group typed_value (LIST) { # optional to allow null lists
+      repeated group list {
+        required group element {        # shredded element
+          optional binary value;
+          optional binary typed_value (STRING);
+        }
+      }
+    }
+  }
+
+The array structure for Variant encoding does not allow missing elements, so 
all elements of the array must
+be *non-nullable*. As such, either **typed_value** or **value** (*but not 
both!*) must be *non-null*. A null
+element must be encoded as a Variant null: *basic type* ``0`` (primitive) and 
*physical type* ``0`` (null).
+
+The storage type to represent this in Arrow as a Variant extension type would 
be: ::
+
+  struct<
+    metadata: binary required,
+    value: binary optional,
+    typed_value: list<element: struct<
+      value: binary optional,
+      typed_value: string optional
+    > required> optional
+  >
+
+.. note::
+
+  As usual, **Binary** could also be **LargeBinary** or **BinaryView**, 
**String** could also be **LargeString** or **StringView**,
+  and **List** could also be **LargeList** or **ListView**.
+
+The data would then be stored in Arrow as follows: ::
+
+  * Length: 4, Null count: 1
+  * Validity Bitmap buffer:
+
+    | Byte 0 (validity bitmap) | Bytes 1-63    |
+    |--------------------------|---------------|
+    | 00000111                 | 0 (padding)   |
+
+  * Children arrays:
+    * field-0 array (`VarBinary` Metadata)
+      * Length: 4, Null count: 0
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 2, 4, 6, 8    | unspecified (padding)    |
+
+      * Value buffer: (01 00 -> indicates version 1 empty metadata)
+
+        | Bytes 0-7          | Bytes 8-63               |
+        |--------------------|--------------------------|
+        | 01 00 01 00 01 00  | unspecified (padding)    |
+
+    * field-1 array (`VarBinary` Value)
+      * Length: 4, Null count: 1
+      * Validity bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63    |
+        |--------------------------|---------------|
+        | 00001000                 | 0 (padding)   |
+
+      * Offsets buffer:
+
+        | Bytes 0-19       | Bytes 20-63              |
+        |------------------|--------------------------|
+        | 0, 0, 0, 0, 1    | unspecified (padding)    |
+
+      * Value buffer: (00 -> variant null)
+
+        | Bytes 0            | Bytes 1-63               |
+        |--------------------|--------------------------|
+        | 00                 | unspecified (padding)    |
+
+    * field-2 array (`List<Struct<VarBinary, String>>` typed_value)
+      * Length: 4, Null count: 1
+      * Validity bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63  |
+        |--------------------------|-------------|
+        | 00000111                 | 0 (padding) |
+
+      * Offsets buffer (int32)
+
+        | Bytes 0-19        | Bytes 20-63           |
+        |-------------------|-----------------------|
+        | 0, 2, 4, 7, 7     | unspecified (padding) |
+
+      * Values array (`Struct<VarBinary, String>` element):
+        * Length: 7, Null count: 0
+        * Validity bitmap buffer: Not required
+
+        * Children arrays:
+          * field-0 array (`VarBinary` value)
+            * Length: 7, Null count: 6
+            * Validity bitmap buffer:
+
+              | Byte 0 (validity bitmap) | Bytes 1-63  |
+              |--------------------------|-------------|
+              | 00001000                 | 0 (padding) |
+
+            * Offsets buffer (int32):
+
+              | Bytes 0-31                | Bytes 32-63              |
+              |---------------------------|--------------------------|
+              | 0, 0, 0, 0, 1, 1, 1, 1    | unspecified (padding)    |
+
+            * Values buffer (`00` -> variant null):
+
+              | Bytes 0            | Bytes 1-63               |
+              |--------------------|--------------------------|
+              | 00                 | unspecified (padding)    |
+
+          * field-1 array (`String` typed_value)
+            * Length: 7, Null count: 1
+            * Validity bitmap buffer:
+
+              | Byte 0 (validity bitmap) | Bytes 1-63  |
+              |--------------------------|-------------|
+              | 01110111                 | 0 (padding) |
+
+            * Offsets buffer (int32):
+
+              | Bytes 0-31                      | Bytes 32-63              |
+              |---------------------------------|--------------------------|
+              | 0, 6, 11, 17, 17, 23, 28, 35    | unspecified (padding)    |
+
+            * Values buffer:
+
+              | Bytes 0-35                           | Bytes 36-63             
 |
+              
|--------------------------------------|--------------------------|
+              | comedydramahorrorcomedydramaromance  | unspecified (padding)   
 |
+
+Shredding an Object
+'''''''''''''''''''
+
+Let's consider a simple JSON column of "events" which contain a field named 
``event_type`` (a string)
+and a field named ``event_ts`` (a timestamp) that we wish to shred into 
separate columns, In Parquet,
+it could look something like this: ::

Review Comment:
   ```suggestion
   it could look something like this::
   ```
   
   A double-colon [(1) leaves behind a colon (2) starts a code block, so no 
need to do 
both](https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#literal-blocks)



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,585 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _variant_extension:
+
+Variant
+=======
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. INT, STRING)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``parquet.variant``.
+
+* The storage type of this extension is a ``StructArray`` that obeys the 
following rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      *(unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)*
+
+    * A field named ``typed_value`` which can be any *primitive type* or a 
``List``, ``LargeList``, ``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a *nested* type, its elements **must** 
be *non-nullable* and **must** be a ``struct`` consisting of
+        at least one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows these same rules for the 
``typed_value`` field.
+
+.. note::
+
+  It is also *permissible* for the ``metadata`` field to be dictionary-encoded 
with an index type of ``int8``.

Review Comment:
   ```suggestion
   .. note::
   
      It is also *permissible* for the ``metadata`` field to be 
dictionary-encoded with an index type of ``int8``.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to