This is an automated email from the ASF dual-hosted git repository.
emkornfield pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new a3dda6a GH-463: Add more types - time, nano timestamps, UUID to
Variant spec (#464)
a3dda6a is described below
commit a3dda6ac6691b33525b75f230a320f98d4027f86
Author: Aihua Xu <[email protected]>
AuthorDate: Tue Dec 10 14:48:32 2024 -0800
GH-463: Add more types - time, nano timestamps, UUID to Variant spec (#464)
* Add more types - time, nano timestamps, UUID to Variant.
* Update type names to align with Parquet logical type
* Update logical type
* Update VariantEncoding.md
Co-authored-by: emkornfield <[email protected]>
* Update VariantEncoding.md
Co-authored-by: emkornfield <[email protected]>
---------
Co-authored-by: emkornfield <[email protected]>
---
VariantEncoding.md | 55 +++++++++++++++++++++++++++++++-----------------------
1 file changed, 32 insertions(+), 23 deletions(-)
diff --git a/VariantEncoding.md b/VariantEncoding.md
index 53a2a68..2930c71 100644
--- a/VariantEncoding.md
+++ b/VariantEncoding.md
@@ -365,6 +365,7 @@ It is semantically identical to the "string" primitive type.
The Decimal type contains a scale, but no precision. The implied precision of
a decimal value is `floor(log_10(val)) + 1`.
# Encoding types
+*Variant basic types*
| Basic Type | ID | Description |
|--------------|-----|---------------------------------------------------|
@@ -373,25 +374,37 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| Object | `2` | A collection of (string-key, variant-value) pairs |
| Array | `3` | An ordered sequence of variant values |
-| Logical Type | Physical Type | Type ID | Equivalent
Parquet Type | Binary format
|
-|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
-| NullType | null | `0` | any
| none
|
-| Boolean | boolean (True) | `1` | BOOLEAN
| none
|
-| Boolean | boolean (False) | `2` | BOOLEAN
| none
|
-| Exact Numeric | int8 | `3` | INT(8,
signed) | 1 byte
|
-| Exact Numeric | int16 | `4` | INT(16,
signed) | 2 byte little-endian
|
-| Exact Numeric | int32 | `5` | INT(32,
signed) | 4 byte little-endian
|
-| Exact Numeric | int64 | `6` | INT(64,
signed) | 8 byte little-endian
|
-| Double | double | `7` | DOUBLE
| IEEE little-endian
|
-| Exact Numeric | decimal4 | `8` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
-| Exact Numeric | decimal8 | `9` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
-| Exact Numeric | decimal16 | `10` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
-| Date | date | `11` | DATE
| 4 byte little-endian
|
-| Timestamp | timestamp | `12` |
TIMESTAMP(true, MICROS) | 8-byte little-endian
|
-| TimestampNTZ | timestamp without time zone | `13` |
TIMESTAMP(false, MICROS) | 8-byte little-endian
|
-| Float | float | `14` | FLOAT
| IEEE little-endian
|
-| Binary | binary | `15` | BINARY
| 4 byte little-endian size, followed by bytes
|
-| String | string | `16` | STRING
| 4 byte little-endian size, followed by UTF-8 encoded bytes
|
+*Variant primitive types*
+
+| Type Equivalence Class | Physical Type | Type ID |
Equivalent Parquet Type | Binary format
|
+|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------|
+| NullType | null | `0` | any
| none
|
+| Boolean | boolean (True) | `1` | BOOLEAN
| none
|
+| Boolean | boolean (False) | `2` | BOOLEAN
| none
|
+| Exact Numeric | int8 | `3` | INT(8,
signed) | 1 byte
|
+| Exact Numeric | int16 | `4` | INT(16,
signed) | 2 byte little-endian
|
+| Exact Numeric | int32 | `5` | INT(32,
signed) | 4 byte little-endian
|
+| Exact Numeric | int64 | `6` | INT(64,
signed) | 8 byte little-endian
|
+| Double | double | `7` | DOUBLE
| IEEE little-endian
|
+| Exact Numeric | decimal4 | `8` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Exact Numeric | decimal8 | `9` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Exact Numeric | decimal16 | `10` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Date | date | `11` | DATE
| 4 byte little-endian
|
+| Timestamp | timestamp with time zone | `12` |
TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian
|
+| TimestampNTZ | timestamp without time zone | `13` |
TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian
|
+| Float | float | `14` | FLOAT
| IEEE little-endian
|
+| Binary | binary | `15` | BINARY
| 4 byte little-endian size, followed by bytes
|
+| String | string | `16` | STRING
| 4 byte little-endian size, followed by UTF-8 encoded bytes
|
+| TimeNTZ | time without time zone | `21` |
TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian
|
+| Timestamp | timestamp with time zone | `22` |
TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian
|
+| TimestampNTZ | timestamp without time zone | `23` |
TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian
|
+| UUID | uuid | `24` | UUID
| 16-byte big-endian
|
+
+The *Type Equivalence Class* column indicates logical equivalence of
physically encoded types.
+For example, a user expression operating on a string value containing "hello"
should behave the same, whether it is encoded with the short string
optimization, or long string encoding.
+Similarly, user expressions operating on an *int8* value of 1 should behave
the same as a decimal16 with scale 2 and unscaled value 100.
+
+*Decimal table*
| Decimal Precision | Decimal value type |
|-----------------------|--------------------|
@@ -400,10 +413,6 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| 18 <= precision <= 38 | int128 |
| > 38 | Not supported |
-The *Logical Type* column indicates logical equivalence of physically encoded
types.
-For example, a user expression operating on a string value containing "hello"
should behave the same, whether it is encoded with the short string
optimization, or long string encoding.
-Similarly, user expressions operating on an *int8* value of 1 should behave
the same as a decimal16 with scale 2 and unscaled value 100.
-
# String values must be UTF-8 encoded
All strings within the Variant binary format must be UTF-8 encoded.