This is an automated email from the ASF dual-hosted git repository.
fokko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 1d81b7a Clarify Variant specification (#457)
1d81b7a is described below
commit 1d81b7a0347ae424a90402c7122c194c087552e0
Author: Gene Pang <[email protected]>
AuthorDate: Wed Nov 6 09:54:34 2024 -0800
Clarify Variant specification (#457)
* [FOLLOWUP] Clarify Variant details
* address feedback
* minor fix
---
VariantEncoding.md | 17 ++++++++++++-----
VariantShredding.md | 20 +++++++++++++++++---
2 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/VariantEncoding.md b/VariantEncoding.md
index 1eac3bc..c6d2d11 100644
--- a/VariantEncoding.md
+++ b/VariantEncoding.md
@@ -93,6 +93,7 @@ Next, is an `offset` list, which contains `dictionary_size +
1` values.
Each `offset` is a little-endian value of `offset_size` bytes, and represents
the starting byte offset of the i-th string in `bytes`.
The first `offset` value will always be `0`, and the last `offset` value will
always be the total length of `bytes`.
The last part of the metadata is `bytes`, which stores all the string values
in the dictionary.
+All string values must be UTF-8 encoded strings.
## Metadata encoding grammar
@@ -107,7 +108,7 @@ offset_size_minus_one: 2-bit value providing the number of
bytes per dictionary
dictionary_size: `offset_size` bytes. little-endian value indicating the
number of strings in the dictionary
dictionary: <offset>* <bytes>
offset: `offset_size` bytes. little-endian value indicating the starting
position of the ith string in `bytes`. The list should contain `dictionary_size
+ 1` values, where the last value is the total length of `bytes`.
-bytes: dictionary string values
+bytes: UTF-8 encoded dictionary string values
```
Notes:
@@ -209,7 +210,7 @@ The [primitive types table](#encoding-types) shows the
encoding format for each
### Value Data for Short string (`basic_type`=1)
-When `basic_type` is `1`, `value_data` is the sequence of bytes that
represents the string.
+When `basic_type` is `1`, `value_data` is the sequence of UTF-8 encoded bytes
that represents the string.
### Value Data for Object (`basic_type`=2)
@@ -337,7 +338,7 @@ object_header: (is_large << 4 | field_id_size_minus_one <<
2 | field_offset_size
array_header: (is_large << 2 | field_offset_size_minus_one)
value_data: <primitive_val> | <short_string_val> | <object_val> | <array_val>
primitive_val: see table for binary representation
-short_string_val: bytes
+short_string_val: UTF-8 encoded bytes
object_val: <num_elements> <field_id>* <field_offset>* <fields>
array_val: <num_elements> <field_offset>* <fields>
num_elements: a 1 or 4 byte little-endian value (depending on is_large in
<object_header>/<array_header>)
@@ -403,11 +404,17 @@ The *Logical Type* column indicates logical equivalence
of physically encoded ty
For example, a user expression operating on a string value containing "hello"
should behave the same, whether it is encoded with the short string
optimization, or long string encoding.
Similarly, user expressions operating on an *int8* value of 1 should behave
the same as a decimal16 with scale 2 and unscaled value 100.
-# Field ID order and uniqueness
+# String values must be UTF-8 encoded
+
+All strings within the Variant binary format must be UTF-8 encoded.
+This includes the dictionary key string values, the "short string" values, and
the "long string" values.
+
+# Object field ID order and uniqueness
For objects, field IDs and offsets must be listed in the order of the
corresponding field names, sorted lexicographically.
-Note that the fields themselves are not required to follow this order.
+Note that the field values themselves are not required to follow this order.
As a result, offsets will not necessarily be listed in ascending order.
+The field values are not required to be in the same order as the field IDs, to
enable flexibility when constructing Variant values.
An implementation may rely on this field ID order in searching for field names.
E.g. a binary search on field IDs (combined with metadata lookups) may be used
to find a field with a given field.
diff --git a/VariantShredding.md b/VariantShredding.md
index 51160a9..31e1f52 100644
--- a/VariantShredding.md
+++ b/VariantShredding.md
@@ -91,7 +91,7 @@ optional group variant_col {
# Parquet Layout
The `array` and `object` fields represent Variant array and object types,
respectively.
-Arrays must use the three-level list structure described in
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
+Arrays must use the three-level list structure described in
[LogicalTypes.md](LogicalTypes.md).
An `object` field must be a group.
Each field name of this inner group corresponds to the Variant value's object
field name.
@@ -143,6 +143,17 @@ There are two main motivations for including the
`variant_value` column:
1) In a case where there are rare type mismatches (for example, a numeric
field with rare strings like “n/a”), we allow the field to be shredded, which
could still be a significant performance benefit compared to fetching and
decoding the full value/metadata binary.
2) Since there is a single schema per file, there would be no easy way to
recover from a type mismatch encountered late in a file write. Parquet files
can be large, and buffering all file data before starting to write could be
expensive. Including a variant column for every field guarantees we can adhere
to the requested shredding schema.
+# Top-level metadata
+
+Any values stored in a shredded `variant_value` field may have dictionary IDs
referring to the metadata.
+There is one metadata value for the entire Variant record, and that is stored
in the top-level `metadata` field.
+This means any `variant_value` values in the shredded representation is only
the "value" portion of the [Variant Binary Encoding](VariantEncoding.md).
+
+The metadata is kept at the top-level, instead of shredding the metadata with
the shredded variant values because:
+* Simplified shredding scheme and specification. No need for additional
struct-of-binary values, or custom concatenated binary scheme for
`variant_value`.
+* Simplified and good performance for write shredding. No need to rebuild the
metadata, or re-encode IDs for `variant_value`.
+* Simplified and good performance for Variant reconstruction. No need to
re-encode IDs for `variant_value`.
+
# Data Skipping
Shredded columns are expected to store statistics in the same format as a
normal Parquet column.
@@ -154,11 +165,14 @@ This specification is not strict about what values may be
stored in `variant_val
# Shredding Semantics
Reconstruction of Variant value from a shredded representation is not expected
to produce a bit-for-bit identical binary to the original unshredded value.
-For example, the order of fields in the binary may change, as may the physical
representation of scalar values.
+For example, in a reconstructed Variant value, the order of object field
values may be different from the original binary.
+This is allowed since the [Variant Binary
Encoding](VariantEncoding.md#object-field-id-order-and-uniqueness) does not
require an ordering of the field values, but the field IDs will still be
ordered lexicographically according to the corresponding field names.
+The physical representation of scalar values may also be different in the
reconstructed Variant binary.
In particular, the [Variant Binary Encoding](VariantEncoding.md) considers all
integer and decimal representations to represent a single logical type.
+This flexibility enables shredding to be applicable in more scenarios, while
maintaining all information and values losslessly.
As a result, it is valid to shred a decimal into a decimal column with a
different scale, or to shred an integer as a decimal, as long as no numeric
precision is lost.
-For example, it would be valid to write the value 123 to a Decimal(9, 2)
column, but the value 1.234 would need to be written to the **variant_value**
column.
+For example, it would be valid to write the value 123 to a Decimal(9, 2)
column, but the value 1.234 would need to be written to the `variant_value`
column.
When reconstructing, it would be valid for a reader to reconstruct 123 as an
integer, or as a Decimal(9, 2).
Engines should not depend on the physical type of a Variant value, only the
logical type.