amoeba commented on code in PR #47456: URL: https://github.com/apache/arrow/pull/47456#discussion_r2319541766
########## docs/source/format/CanonicalExtensions.rst: ########## @@ -417,7 +421,591 @@ better zero-copy compatibility with various systems that also store booleans usi Metadata is an empty string. -========================= +.. _variant_extension: + +Parquet Variant +=============== + +Variant represents a value that may be one of: Review Comment: I'm late to the review party here but do we want to make sure we always refer to this specific Variant as "Parquet Variant" so we don't confuse it with a potential Arrow Variant? ########## docs/source/format/CanonicalExtensions.rst: ########## @@ -45,7 +45,11 @@ types: * The specification text to be added *must* follow these requirements: - 1) It *must* define a well-defined extension name starting with "``arrow.``". + 1) It *must* define a well-defined extension name starting with an allowed prefix. + The currently allowed prefixes are: + * "``arrow.``" - For general-purpose canonical extension types. + * "``parquet.``" - For canonical extension types that are intended primarily for + interoperability with `Apache Parquet <https://parquet.apache.org/>`__ format. Review Comment: ```suggestion interoperability with the `Apache Parquet <https://parquet.apache.org/>`__ format. ``` ########## docs/source/format/CanonicalExtensions.rst: ########## @@ -417,7 +421,591 @@ better zero-copy compatibility with various systems that also store booleans usi Metadata is an empty string. -========================= +.. _parquet_variant_extension: + +Parquet Variant +=============== + +Variant represents a value that may be one of: + +* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``) + +* Array: An ordered list of Variant values + +* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys + +Particularly, this provides a way to represent semi-structured data which is stored as a +`Parquet Variant <https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ value within Arrow columns in +a lossless fashion. This also provides the ability to represent `shredded <https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__ +variant values. This will make it possible for systems to pass Variant data around without having to upgrade their Arrow version +or otherwise require special handling unless they want to directly interact with the encoded variant data. See the previous links +to the Parquet format specification for details on what the actual binary values should look like. + +* Extension name: ``parquet.variant``. + +* The storage type of this extension is a ``Struct`` that obeys the following rules: + + * A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * At least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + *(unshredded variants consist of just the ``metadata`` and ``value`` fields only)* + + * A field named ``typed_value`` which can be any :term:`primitive type` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` + + * If the ``typed_value`` field is a *nested* type, its elements **must** be *non-nullable* and **must** be a ``Struct`` consisting of + at least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data). +* Extension type parameters: + + This type does not have any parameters. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``. + +.. note:: + + The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive. + +Examples +-------- + +Unshredded +'''''''''' + +The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of +the following storage types are valid (not an exhaustive list): + +* ``struct<metadata: binary non-nullable, value: binary nullable>`` +* ``struct<value: binary nullable, metadata: binary non-nullable>`` +* ``struct<metadata: dictionary<int8, binary> non-nullable, value: binary_view nullable>`` + +Simple Shredding +'''''''''''''''' + +Suppose we have a Variant field named *measurement* and we want to shred the ``int64`` values into a separate column for efficiency. +In Parquet, this could be represented as:: + + required group measurement (VARIANT) { + required binary metadata; + optional binary value; + optional int64 typed_value; + } + +Thus the corresponding storage type for the ``parquet.variant`` Arrow extension type would be:: + + struct< + metadata: binary required, + value: binary optional, + typed_value: int64 optional + > + +If we suppose a series of measurements consisting of:: + + 34, null, "n/a", 100 + +The data should be stored/represented in Arrow as:: + + * Length: 4, Null count: 1 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001011 | 0 (padding) | + + * Children arrays: + * field-0 array (`VarBinary`) + * Length: 4, Null count: 0 + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 2, 4, 6, 8 | unspecified (padding) | + + * Value buffer: (01 00 -> indicates version 1 empty metadata) + + | Bytes 0-7 | Bytes 8-63 | + |-------------------------|--------------------------| + | 01 00 01 00 01 00 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary`) + * Length: 4, Null count: 2 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00000110 | 0 (padding) | + + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 0, 1, 5, 5 | unspecified (padding) | + + * Value buffer: (`00` -> literal null, `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a") Review Comment: This gets cut off in the rendered HTML, can we hard-wrap this or something? ########## docs/source/format/CanonicalExtensions.rst: ########## @@ -417,7 +421,592 @@ better zero-copy compatibility with various systems that also store booleans usi Metadata is an empty string. -========================= +.. _parquet_variant_extension: + +Parquet Variant +=============== + +Variant represents a value that may be one of: + +* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``) + +* Array: An ordered list of Variant values + +* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys + +Particularly, this provides a way to represent semi-structured data which is stored as a +`Parquet Variant <https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ value within Arrow columns in +a lossless fashion. This also provides the ability to represent `shredded <https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__ +variant values. This will make it possible for systems to pass Variant data around without having to upgrade their Arrow version +or otherwise require special handling unless they want to directly interact with the encoded variant data. See the previous links +to the Parquet format specification for details on what the actual binary values should look like. + +* Extension name: ``parquet.variant``. + +* The storage type of this extension is a ``Struct`` that obeys the following rules: + + * A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * At least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + *(unshredded variants consist of just the ``metadata`` and ``value`` fields only)* + + * A field named ``typed_value`` which can be any :term:`primitive type` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` + + * If the ``typed_value`` field is a *nested* type, its elements **must** be *non-nullable* and **must** be a ``Struct`` consisting of + at least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data). + +* Extension type parameters: + + This type does not have any parameters. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``. + +.. note:: + + The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive. + +Examples +-------- + +Unshredded +'''''''''' + +The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of +the following storage types are valid (not an exhaustive list): + +* ``struct<metadata: binary non-nullable, value: binary nullable>`` +* ``struct<value: binary nullable, metadata: binary non-nullable>`` +* ``struct<metadata: dictionary<int8, binary> non-nullable, value: binary_view nullable>`` + +Simple Shredding +'''''''''''''''' + +Suppose we have a Variant field named *measurement* and we want to shred the ``int64`` values into a separate column for efficiency. +In Parquet, this could be represented as:: + + required group measurement (VARIANT) { + required binary metadata; + optional binary value; + optional int64 typed_value; + } + +Thus the corresponding storage type for the ``parquet.variant`` Arrow extension type would be:: + + struct< + metadata: binary non-nullable, + value: binary nullable, + typed_value: int64 nullable + > + +If we suppose a series of measurements consisting of:: + + 34, null, "n/a", 100 + +The data should be stored/represented in Arrow as:: + + * Length: 4, Null count: 1 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001011 | 0 (padding) | + + * Children arrays: + * field-0 array (`VarBinary`) + * Length: 4, Null count: 0 + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 2, 4, 6, 8 | unspecified (padding) | + + * Value buffer: (01 00 -> indicates version 1 empty metadata) + + | Bytes 0-7 | Bytes 8-63 | + |-------------------------|--------------------------| + | 01 00 01 00 01 00 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary`) + * Length: 4, Null count: 2 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00000110 | 0 (padding) | + + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 0, 1, 5, 5 | unspecified (padding) | + + * Value buffer: (`00` -> literal null, `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a") + + | Bytes 0-4 | Bytes 5-63 | + |------------------------|--------------------------| + | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding) | + + * field-2 array (int64 array) + * Length: 4, Null count: 2 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001001 | 0 (padding) | + + * Value buffer: + + | Bytes 0-31 | Bytes 32-63 | + |---------------------|--------------------------| + | 34, 00, 00, 100 | unspecified (padding) | + +.. note:: + + Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the + `shredding specification <https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding>`__ + so that a consumer can tell the difference between a *missing* field and a **null** field. A null + element must be encoded as a Variant null: *basic type* ``0`` (primitive) and *physical type* ``0`` (null). + +Shredding an Array +'''''''''''''''''' + +For our next example, we will represent a shredded array of strings. Let's consider a column that looks like: :: + + ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null + +Representing this shredded variant in Parquet could look like:: + + optional group tags (VARIANT) { + required binary metadata; + optional binary value; + optional group typed_value (LIST) { # optional to allow null lists + repeated group list { + required group element { # shredded element + optional binary value; + optional binary typed_value (STRING); + } + } + } + } + +The array structure for Variant encoding does not allow missing elements, so all elements of the array must +be *non-nullable*. As such, either **typed_value** or **value** (*but not both!*) must be *non-null*. + +The storage type to represent this in Arrow as a Variant extension type would be:: + + struct< + metadata: binary non-nullable, + value: binary nullable, + typed_value: list<element: struct< + value: binary nullable, + typed_value: string nullable + > non-nullable> nullable + > + +.. note:: + Review Comment: Why bold? ########## docs/source/format/CanonicalExtensions.rst: ########## @@ -417,7 +421,591 @@ better zero-copy compatibility with various systems that also store booleans usi Metadata is an empty string. -========================= +.. _variant_extension: + +Parquet Variant +=============== + +Variant represents a value that may be one of: Review Comment: (Suggestion here would be replace standalone "Variant" with "Parquet Variant" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
