zeroshade commented on code in PR #47456: URL: https://github.com/apache/arrow/pull/47456#discussion_r2319586882
########## docs/source/format/CanonicalExtensions.rst: ########## @@ -417,7 +421,591 @@ better zero-copy compatibility with various systems that also store booleans usi Metadata is an empty string. -========================= +.. _variant_extension: + +Parquet Variant +=============== + +Variant represents a value that may be one of: + +* Primitive: a type and corresponding value (e.g. INT, STRING) + +* Array: An ordered list of Variant values + +* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys + +Particularly, this provides a way to represent semi-structured data which is stored as a +`Parquet Variant <https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ value within Arrow columns in +a lossless fashion. This also provides the ability to represent `shredded <https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__ +variant values. This will make it possible for systems to pass Variant data around without having to upgrade their Arrow version +or otherwise require special handling unless they want to directly interact with the encoded variant data. See the previous links +to the Parquet format specification for details on what the actual binary values should look like. + +* Extension name: ``parquet.variant``. + +* The storage type of this extension is a ``Struct`` that obeys the following rules: + + * A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * At least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + *(unshredded variants consist of just the ``metadata`` and ``value`` fields only)* + + * A field named ``typed_value`` which can be any :term:`primitive type` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` + + * If the ``typed_value`` field is a *nested* type, its elements **must** be *non-nullable* and **must** be a ``Struct`` consisting of + at least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data). +* Extension type parameters: + + This type does not have any parameters. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``. + +.. note:: + + The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive. + +Examples +-------- + +Unshredded +'''''''''' + +The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of +the following storage types are valid (not an exhaustive list): + +* ``struct<metadata: binary non-nullable, value: binary nullable>`` Review Comment: Because the Parquet spec also explicitly states that the order isn't prescriptive and fields need to be accessed by name. In the interest of flexibility given that parquet, iceberg and spark implementations all state order doesn't matter and that fields are accessed by name, I followed that lead. The other issue I can think of is that it would be a problem if a user creates a new parquet file from the resulting data after Arrow re-orders the columns, the Parquet schemas would no longer be equivalent/compatible. And even if we prescribe the order of the metadata/value/typed_value fields, you can't fix the order of the shredded fields. Simply put, given that this is defined in terms of an extension type based on struct arrays, it would be hard to enforce ordering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org