This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c728c223f8eb [SPARK-49449][SQL][DOCS] Remove string-from-metadata and
binary-from-metadata
c728c223f8eb is described below
commit c728c223f8eb424b18c1d43f28b28f9b3546a953
Author: cashmand <[email protected]>
AuthorDate: Thu Aug 29 17:16:26 2024 +0900
[SPARK-49449][SQL][DOCS] Remove string-from-metadata and
binary-from-metadata
### What changes were proposed in this pull request?
The string-from-metadata and binary-from-metadata types were included in
the initial spec, but never implemented for Spark 4.0 due to complexity and
lack of a compelling use case. This PR removes them from the spec to align with
the implementation. Nothing prevents us from adding these in the future, but
Spark 4.0 would presumably not be able to read such a value, so having it in
the spec at this point is confusing.
### Why are the changes needed?
Clarifies Spark behavior.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
It is a README-only change.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #47917 from cashmand/SPARK-49449.
Authored-by: cashmand <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
common/variant/README.md | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/common/variant/README.md b/common/variant/README.md
index 391815dabf99..a66d708da75b 100644
--- a/common/variant/README.md
+++ b/common/variant/README.md
@@ -322,8 +322,6 @@ Each `array_val` and `object_val` must contain exactly
`num_elements + 1` values
The "short string" basic type may be used as an optimization to fold string
length into the type byte for strings less than 64 bytes. It is semantically
identical to the "string" primitive type.
-String and binary values may also be represented as an index into the metadata
dictionary. (See “string from metadata” and “binary from metadata” in the
“Primitive Types” table) Writers may choose to use this mechanism to avoid
repeating identical string values in a Variant object.
-
The Decimal type contains a scale, but no precision. The implied precision of
a decimal value is `floor(log_10(val)) + 1`.
# Encoding types
@@ -354,8 +352,6 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| float | `14` | FLOAT | IEEE
little-endian
|
| binary | `15` | BINARY | 4 byte
little-endian size, followed by bytes
|
| string | `16` | STRING | 4 byte
little-endian size, followed by UTF-8 encoded bytes
|
-| binary from metadata | `17` | BINARY |
Little-endian index into the metadata dictionary. Number of bytes is equal to
the metadata `offset_size`. |
-| string from metadata | `18` | STRING |
Little-endian index into the metadata dictionary. Number of bytes is equal to
the metadata `offset_size`. |
| year-month interval | `19` | INT(32, signed)<sup>1</sup> | 1 byte
denoting start field (1 bit) and end field (1 bit) starting at LSB followed by
4-byte little-endian value. |
| day-time interval | `20` | INT(64, signed)<sup>1</sup> | 1 byte
denoting start field (2 bits) and end field (2 bits) starting at LSB followed
by 8-byte little-endian value. |
@@ -368,6 +364,8 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
The year-month and day-time interval types have one byte at the beginning
indicating the start and end fields. In the case of the year-month interval,
the least significant bit denotes the start field and the next least
significant bit denotes the end field. The remaining 6 bits are unused. A field
value of 0 represents YEAR and 1 represents MONTH. In the case of the day-time
interval, the least significant 2 bits denote the start field and the next
least significant 2 bits denote the en [...]
+Type IDs 17 and 18 were originally reserved for a prototype feature
(string-from-metadata) that was never implemented. These IDs are available for
use by new types.
+
[1] The parquet format does not have pure equivalents for the year-month and
day-time interval types. Year-month intervals are usually represented using
int32 values and the day-time intervals are usually represented using int64
values. However, these values don't include the start and end fields of these
types. Therefore, Spark stores them in the column metadata.
# Field ID order and uniqueness
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]