zeroshade commented on code in PR #47456:
URL: https://github.com/apache/arrow/pull/47456#discussion_r2331124199


##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -417,7 +421,133 @@ better zero-copy compatibility with various systems that 
also store booleans usi
 
   Metadata is an empty string.
 
-=========================
+.. _parquet_variant_extension:
+
+Parquet Variant
+===============
+
+Variant represents a value that may be one of:
+
+* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``)
+
+* Array: An ordered list of Variant values
+
+* Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys
+
+Particularly, this provides a way to represent semi-structured data which is 
stored as a
+`Parquet Variant 
<https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ 
value within Arrow columns in
+a lossless fashion. This also provides the ability to represent `shredded 
<https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
+variant values. This will make it possible for systems to pass Variant data 
around without having to upgrade their Arrow version
+or otherwise require special handling unless they want to directly interact 
with the encoded variant data. See the previous links
+to the Parquet format specification for details on what the actual binary 
values should look like.
+
+* Extension name: ``arrow.parquet.variant``.
+
+* The storage type of this extension is a ``Struct`` that obeys the following 
rules:
+
+  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+  * At least one (or both) of the following:
+
+    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or 
``BinaryView``.
+      (unshredded variants consist of just the ``metadata`` and ``value`` 
fields only)
+
+    * A field named ``typed_value`` which can be a 
:ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, 
``ListView`` or ``Struct``
+
+      * If the ``typed_value`` field is a ``List``, ``LargeList`` or 
``ListView`` its elements **must** be *non-nullable* and **must**
+        be a ``Struct`` consisting of at least one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows the rules outlined above 
(this allows for arbitrarily nested data).
+
+      * If the ``typed_value`` field is a ``Struct``, then its fields **must** 
be *non-nullable*, representing the fields being shredded
+        from the objects, and **must** be a ``Struct`` consisting of at least 
one (or both) of the following:
+
+        * A field named ``value`` which is of type ``Binary``, 
``LargeBinary``, or ``BinaryView``.
+
+        * A field named ``typed_value`` which follows the rules outlined above 
(this allows for arbitrarily nested data).
+
+* Extension type parameters:
+
+  This type does not have any parameters.
+
+* Description of the serialization:
+
+  Extension metadata is an empty string.
+
+.. note::
+
+   It is also *permissible* for the ``metadata`` field to be 
dictionary-encoded with a preferred (*but not required*) index type of ``int8``,
+   or run-end-encoded with a preferred (*but not required*) runs type of 
``int8``.
+
+.. note::
+
+   The fields may be in any order, and thus must be accessed by **name** not 
by *position*. The field names are case sensitive.
+
+.. _variant_primitive_type_mapping:
+
+Primitive Type Mappings
+-----------------------
+
++----------------------+------------------------+
+| Arrow Primitive Type | Variant Primitive Type |
++======================+========================+
+| Null                 | Null                   |
++----------------------+------------------------+
+| Boolean (true/false) | Boolean                |

Review Comment:
   good point, my mistake there. I'll swap these



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to