Re: [PR] GH-455: Add Variant specification docs [parquet-format]

via GitHub Fri, 11 Oct 2024 11:51:54 -0700


julienledem commented on code in PR #456:
URL: https://github.com/apache/parquet-format/pull/456#discussion_r1797319275



##########
VariantEncoding.md:
##########
@@ -0,0 +1,429 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Variant Binary Encoding
+
+> [!IMPORTANT]
+> **This specification is still under active development, and has not been 
formally adopted.**
+
+A Variant represents a type that contain one of:
+- Primitive: A type and corresponding value (e.g. INT, STRING)
+- Array: An ordered list of Variant values
+- Object: An unordered collection of string/Variant pairs (i.e. key/value 
pairs). An object may not contain duplicate keys.
+
+A Variant is encoded with 2 binary values, the [value](#value-encoding) and 
the [metadata](#metadata-encoding).
+
+There are a fixed number of allowed primitive types, provided in the table 
below.
+These represent a commonly supported subset of the [logical 
types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 
allowed by the Parquet.
+
+The Variant Binary Encoding allows representation of semi-structured data 
(e.g. JSON) in a form that can be efficiently queried by path.
+The design is intended to allow efficient access to nested data even in the 
presence of very wide or deep structures.
+
+Another motivation for the representation is that (aside from metadata) each 
nested Variant value is contiguous and self-contained.
+For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
+
+This document describes the Variant Binary Encoding scheme.
+[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+
+# Variant in Parquet
+A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
+Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
+
+# Metadata encoding
+
+The encoded metadata always starts with a header byte.
+```
+             7     6  5   4  3             0
+            +-------+---+---+---------------+
+header      |       |   |   |    version    |
+            +-------+---+---+---------------+
+                ^         ^
+                |         +-- sorted_strings
+                +-- offset_size_minus_one
+```
+The `version` is a 4-bit value that must always contain the value `1`.
+`sorted_strings` is a 1-bit value indicating whether dictionary strings are 
sorted and unique.
+`offset_size_minus_one` is a 2-bit value providing the number of bytes per 
dictionary size and offset field.
+The actual number of bytes, `offset_size`, is `offset_size_minus_one + 1`.
+
+The entire metadata is encoded as the following diagram shows:
+```
+           7                     0
+          +-----------------------+
+metadata  |        header         |
+          +-----------------------+
+          |                       |
+          :    dictionary_size    :  <-- little-endian, `offset_size` bytes
+          |                       |
+          +-----------------------+
+          |                       |
+          :        offset         :  <-- little-endian, `offset_size` bytes
+          |                       |
+          +-----------------------+
+                      :
+          +-----------------------+
+          |                       |
+          :        offset         :  <-- little-endian, `offset_size` bytes
+          |                       |      (`dictionary_size + 1` offsets)
+          +-----------------------+
+          |                       |
+          :         bytes         :
+          |                       |
+          +-----------------------+
+```
+
+The metadata is encoded first with the `header` byte, then `dictionary_size` 
which is a little-endian value of `offset_size` bytes, and represents the 
number of string values in the dictionary.
+Next, is an `offset` list, which contains `dictionary_size + 1` values.
+Each `offset` is a little-endian value of `offset_size` bytes, and represents 
the starting byte offset of the i-th string in `bytes`.
+The first `offset` value will always be `0`, and the last `offset` value will 
always be the total length of `bytes`.
+The last part of the metadata is `bytes`, which stores all the string values 
in the dictionary.
+
+## Metadata encoding grammar
+
+The grammar for encoded metadata is as follows
+
+```
+metadata: <header> <dictionary_size> <dictionary>
+header: 1 byte (<version> | <sorted_strings> << 4 | (<offset_size_minus_one> 
<< 6))
+version: a 4-bit version ID. Currently, must always contain the value 1
+sorted_strings: a 1-bit value indicating whether metadata strings are sorted
+offset_size_minus_one: 2-bit value providing the number of bytes per 
dictionary size and offset field.
+dictionary_size: `offset_size` bytes. little-endian value indicating the 
number of strings in the dictionary
+dictionary: <offset>* <bytes>
+offset: `offset_size` bytes. little-endian value indicating the starting 
position of the ith string in `bytes`. The list should contain `dictionary_size 
+ 1` values, where the last value is the total length of `bytes`.
+bytes: dictionary string values
+```
+
+Notes:
+- Offsets are relative to the start of the `bytes` array.
+- The length of the ith string can be computed as `offset[i+1] - offset[i]`.
+- The offset of the first string is always equal to 0 and is therefore 
redundant. It is included in the spec to simplify in-memory-processing.
+- `offset_size_minus_one` indicates the number of bytes per `dictionary_size` 
and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 
2-byte offsets, 2 indicates 3 byte offsets and 3 indicates 4-byte offsets.
+- If `sorted_strings` is set to 1, strings in the dictionary must be unique 
and sorted in lexicographic order. If the value is set to 0, readers may not 
make any assumptions about string order or uniqueness.

Review Comment:
   so lexicographic order is defined by unicode code points.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] GH-455: Add Variant specification docs [parquet-format]

Reply via email to