(parquet-format) branch master updated: Clarify Variant specification (#457)

fokko Wed, 06 Nov 2024 09:55:29 -0800

This is an automated email from the ASF dual-hosted git repository.

fokko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new 1d81b7a  Clarify Variant specification (#457)
1d81b7a is described below

commit 1d81b7a0347ae424a90402c7122c194c087552e0
Author: Gene Pang <[email protected]>
AuthorDate: Wed Nov 6 09:54:34 2024 -0800

    Clarify Variant specification (#457)
    
    * [FOLLOWUP] Clarify Variant details
    
    * address feedback
    
    * minor fix
---
 VariantEncoding.md  | 17 ++++++++++++-----
 VariantShredding.md | 20 +++++++++++++++++---
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/VariantEncoding.md b/VariantEncoding.md
index 1eac3bc..c6d2d11 100644
--- a/VariantEncoding.md
+++ b/VariantEncoding.md
@@ -93,6 +93,7 @@ Next, is an `offset` list, which contains `dictionary_size + 
1` values.
 Each `offset` is a little-endian value of `offset_size` bytes, and represents 
the starting byte offset of the i-th string in `bytes`.
 The first `offset` value will always be `0`, and the last `offset` value will 
always be the total length of `bytes`.
 The last part of the metadata is `bytes`, which stores all the string values 
in the dictionary.
+All string values must be UTF-8 encoded strings.
 
 ## Metadata encoding grammar
 
@@ -107,7 +108,7 @@ offset_size_minus_one: 2-bit value providing the number of 
bytes per dictionary
 dictionary_size: `offset_size` bytes. little-endian value indicating the 
number of strings in the dictionary
 dictionary: <offset>* <bytes>
 offset: `offset_size` bytes. little-endian value indicating the starting 
position of the ith string in `bytes`. The list should contain `dictionary_size 
+ 1` values, where the last value is the total length of `bytes`.
-bytes: dictionary string values
+bytes: UTF-8 encoded dictionary string values
 ```
 
 Notes:
@@ -209,7 +210,7 @@ The [primitive types table](#encoding-types) shows the 
encoding format for each
 
 ### Value Data for Short string (`basic_type`=1)
 
-When `basic_type` is `1`, `value_data` is the sequence of bytes that 
represents the string.
+When `basic_type` is `1`, `value_data` is the sequence of UTF-8 encoded bytes 
that represents the string.
 
 ### Value Data for Object (`basic_type`=2)
 
@@ -337,7 +338,7 @@ object_header: (is_large << 4 | field_id_size_minus_one << 
2 | field_offset_size
 array_header: (is_large << 2 | field_offset_size_minus_one)
 value_data:  <primitive_val> | <short_string_val> | <object_val> | <array_val>
 primitive_val: see table for binary representation
-short_string_val: bytes
+short_string_val: UTF-8 encoded bytes
 object_val: <num_elements> <field_id>* <field_offset>* <fields>
 array_val: <num_elements> <field_offset>* <fields>
 num_elements: a 1 or 4 byte little-endian value (depending on is_large in 
<object_header>/<array_header>)
@@ -403,11 +404,17 @@ The *Logical Type* column indicates logical equivalence 
of physically encoded ty
 For example, a user expression operating on a string value containing "hello" 
should behave the same, whether it is encoded with the short string 
optimization, or long string encoding.
 Similarly, user expressions operating on an *int8* value of 1 should behave 
the same as a decimal16 with scale 2 and unscaled value 100.
 
-# Field ID order and uniqueness
+# String values must be UTF-8 encoded
+
+All strings within the Variant binary format must be UTF-8 encoded.
+This includes the dictionary key string values, the "short string" values, and 
the "long string" values.
+
+# Object field ID order and uniqueness
 
 For objects, field IDs and offsets must be listed in the order of the 
corresponding field names, sorted lexicographically.
-Note that the fields themselves are not required to follow this order.
+Note that the field values themselves are not required to follow this order.
 As a result, offsets will not necessarily be listed in ascending order.
+The field values are not required to be in the same order as the field IDs, to 
enable flexibility when constructing Variant values.
 
 An implementation may rely on this field ID order in searching for field names.
 E.g. a binary search on field IDs (combined with metadata lookups) may be used 
to find a field with a given field.
diff --git a/VariantShredding.md b/VariantShredding.md
index 51160a9..31e1f52 100644
--- a/VariantShredding.md
+++ b/VariantShredding.md
@@ -91,7 +91,7 @@ optional group variant_col {
 # Parquet Layout
 
 The `array` and `object` fields represent Variant array and object types, 
respectively.
-Arrays must use the three-level list structure described in 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
+Arrays must use the three-level list structure described in 
[LogicalTypes.md](LogicalTypes.md).
 
 An `object` field must be a group.
 Each field name of this inner group corresponds to the Variant value's object 
field name.
@@ -143,6 +143,17 @@ There are two main motivations for including the 
`variant_value` column:
 1) In a case where there are rare type mismatches (for example, a numeric 
field with rare strings like “n/a”), we allow the field to be shredded, which 
could still be a significant performance benefit compared to fetching and 
decoding the full value/metadata binary.
 2) Since there is a single schema per file, there would be no easy way to 
recover from a type mismatch encountered late in a file write. Parquet files 
can be large, and buffering all file data before starting to write could be 
expensive. Including a variant column for every field guarantees we can adhere 
to the requested shredding schema.
 
+# Top-level metadata
+
+Any values stored in a shredded `variant_value` field may have dictionary IDs 
referring to the metadata.
+There is one metadata value for the entire Variant record, and that is stored 
in the top-level `metadata` field.
+This means any `variant_value` values in the shredded representation is only 
the "value" portion of the [Variant Binary Encoding](VariantEncoding.md).
+
+The metadata is kept at the top-level, instead of shredding the metadata with 
the shredded variant values because:
+* Simplified shredding scheme and specification. No need for additional 
struct-of-binary values, or custom concatenated binary scheme for 
`variant_value`.
+* Simplified and good performance for write shredding. No need to rebuild the 
metadata, or re-encode IDs for `variant_value`.
+* Simplified and good performance for Variant reconstruction. No need to 
re-encode IDs for `variant_value`.
+
 # Data Skipping
 
 Shredded columns are expected to store statistics in the same format as a 
normal Parquet column.
@@ -154,11 +165,14 @@ This specification is not strict about what values may be 
stored in `variant_val
 # Shredding Semantics
 
 Reconstruction of Variant value from a shredded representation is not expected 
to produce a bit-for-bit identical binary to the original unshredded value.
-For example, the order of fields in the binary may change, as may the physical 
representation of scalar values.
+For example, in a reconstructed Variant value, the order of object field 
values may be different from the original binary.
+This is allowed since the [Variant Binary 
Encoding](VariantEncoding.md#object-field-id-order-and-uniqueness) does not 
require an ordering of the field values, but the field IDs will still be 
ordered lexicographically according to the corresponding field names.
 
+The physical representation of scalar values may also be different in the 
reconstructed Variant binary.
 In particular, the [Variant Binary Encoding](VariantEncoding.md) considers all 
integer and decimal representations to represent a single logical type.
+This flexibility enables shredding to be applicable in more scenarios, while 
maintaining all information and values losslessly.
 As a result, it is valid to shred a decimal into a decimal column with a 
different scale, or to shred an integer as a decimal, as long as no numeric 
precision is lost.
-For example, it would be valid to write the value 123 to a Decimal(9, 2) 
column, but the value 1.234 would need to be written to the **variant_value** 
column.
+For example, it would be valid to write the value 123 to a Decimal(9, 2) 
column, but the value 1.234 would need to be written to the `variant_value` 
column.
 When reconstructing, it would be valid for a reader to reconstruct 123 as an 
integer, or as a Decimal(9, 2).
 Engines should not depend on the physical type of a Variant value, only the 
logical type.

(parquet-format) branch master updated: Clarify Variant specification (#457)

Reply via email to