This is an automated email from the ASF dual-hosted git repository.
wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c5c880e690c3 [SPARK-49591][SQL] Add Logical Type column to variant
readme
c5c880e690c3 is described below
commit c5c880e690c38b2bb597b7a38f20b32e2e2d272c
Author: cashmand <[email protected]>
AuthorDate: Thu Sep 12 22:35:57 2024 +0800
[SPARK-49591][SQL] Add Logical Type column to variant readme
### What changes were proposed in this pull request?
Add a concept of logical type to the variant README.md, distinct from the
physical encoding of a value. In particular, decimal and integer values are
considered to be members of a single "Exact Numeric" type.
### Why are the changes needed?
This is intended to describe and justify the existing Spark behaviour for
Variant (e.g. stripping trailing zeros for decimal to string casts), not change
it. (Although the SchemaOfVariant expression does not strictly follow this
right now for numeric types, and should be updated to match it.) The motivation
for introducing a single numeric type that encompasses integer and decimal
values is to allow more flexibility in storage (particularly once shredding is
introduced), and provide a [...]
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
It is a documentation change.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #48064 from cashmand/cashmand/SPARK-49591.
Authored-by: cashmand <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
---
common/variant/README.md | 44 +++++++++++++++++++++++---------------------
1 file changed, 23 insertions(+), 21 deletions(-)
diff --git a/common/variant/README.md b/common/variant/README.md
index a66d708da75b..4ed7c16f5b6e 100644
--- a/common/variant/README.md
+++ b/common/variant/README.md
@@ -333,27 +333,27 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| Object | `2` | A collection of (string-key, variant-value) pairs |
| Array | `3` | An ordered sequence of variant values |
-| Primitive Type | Type ID | Equivalent Parquet Type | Binary
format
|
-|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
-| null | `0` | any | none
|
-| boolean (True) | `1` | BOOLEAN | none
|
-| boolean (False) | `2` | BOOLEAN | none
|
-| int8 | `3` | INT(8, signed) | 1 byte
|
-| int16 | `4` | INT(16, signed) | 2 byte
little-endian
|
-| int32 | `5` | INT(32, signed) | 4 byte
little-endian
|
-| int64 | `6` | INT(64, signed) | 8 byte
little-endian
|
-| double | `7` | DOUBLE | IEEE
little-endian
|
-| decimal4 | `8` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| decimal8 | `9` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| decimal16 | `10` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| date | `11` | DATE | 4 byte
little-endian
|
-| timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte
little-endian
|
-| timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte
little-endian
|
-| float | `14` | FLOAT | IEEE
little-endian
|
-| binary | `15` | BINARY | 4 byte
little-endian size, followed by bytes
|
-| string | `16` | STRING | 4 byte
little-endian size, followed by UTF-8 encoded bytes
|
-| year-month interval | `19` | INT(32, signed)<sup>1</sup> | 1 byte
denoting start field (1 bit) and end field (1 bit) starting at LSB followed by
4-byte little-endian value. |
-| day-time interval | `20` | INT(64, signed)<sup>1</sup> | 1 byte
denoting start field (2 bits) and end field (2 bits) starting at LSB followed
by 8-byte little-endian value. |
+| Logical Type | Physical Type | Type ID | Equivalent
Parquet Type | Binary format
|
+|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
+| NullType | null | `0` | any
| none
|
+| Boolean | boolean (True) | `1` | BOOLEAN
| none
|
+| Boolean | boolean (False) | `2` | BOOLEAN
| none
|
+| Exact Numeric | int8 | `3` | INT(8,
signed) | 1 byte
|
+| Exact Numeric | int16 | `4` | INT(16,
signed) | 2 byte little-endian
|
+| Exact Numeric | int32 | `5` | INT(32,
signed) | 4 byte little-endian
|
+| Exact Numeric | int64 | `6` | INT(64,
signed) | 8 byte little-endian
|
+| Double | double | `7` | DOUBLE
| IEEE little-endian
|
+| Exact Numeric | decimal4 | `8` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Exact Numeric | decimal8 | `9` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Exact Numeric | decimal16 | `10` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Date | date | `11` | DATE
| 4 byte little-endian
|
+| Timestamp | timestamp | `12` |
TIMESTAMP(true, MICROS) | 8-byte little-endian
|
+| TimestampNTZ | timestamp without time zone | `13` |
TIMESTAMP(false, MICROS) | 8-byte little-endian
|
+| Float | float | `14` | FLOAT
| IEEE little-endian
|
+| Binary | binary | `15` | BINARY
| 4 byte little-endian size, followed by bytes
|
+| String | string | `16` | STRING
| 4 byte little-endian size, followed by UTF-8 encoded bytes
|
+| YMInterval | year-month interval | `19` | INT(32,
signed)<sup>1</sup> | 1 byte denoting start field (1 bit) and end field (1 bit)
starting at LSB followed by 4-byte little-endian value. |
+| DTInterval | day-time interval | `20` | INT(64,
signed)<sup>1</sup> | 1 byte denoting start field (2 bits) and end field (2
bits) starting at LSB followed by 8-byte little-endian value. |
| Decimal Precision | Decimal value type |
|-----------------------|--------------------|
@@ -362,6 +362,8 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| 18 <= precision <= 38 | int128 |
| > 38 | Not supported |
+The *Logical Type* column indicates logical equivalence of physically encoded
types. For example, a user expression operating on a string value containing
"hello" should behave the same, whether it is encoded with the short string
optimization, or long string encoding. Similarly, user expressions operating on
an *int8* value of 1 should behave the same as a decimal16 with scale 2 and
unscaled value 100.
+
The year-month and day-time interval types have one byte at the beginning
indicating the start and end fields. In the case of the year-month interval,
the least significant bit denotes the start field and the next least
significant bit denotes the end field. The remaining 6 bits are unused. A field
value of 0 represents YEAR and 1 represents MONTH. In the case of the day-time
interval, the least significant 2 bits denote the start field and the next
least significant 2 bits denote the en [...]
Type IDs 17 and 18 were originally reserved for a prototype feature
(string-from-metadata) that was never implemented. These IDs are available for
use by new types.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]