(spark) branch master updated: [SPARK-49449][SQL][DOCS] Remove string-from-metadata and binary-from-metadata

gurwls223 Thu, 29 Aug 2024 01:17:06 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new c728c223f8eb [SPARK-49449][SQL][DOCS] Remove string-from-metadata and 
binary-from-metadata
c728c223f8eb is described below

commit c728c223f8eb424b18c1d43f28b28f9b3546a953
Author: cashmand <[email protected]>
AuthorDate: Thu Aug 29 17:16:26 2024 +0900

    [SPARK-49449][SQL][DOCS] Remove string-from-metadata and 
binary-from-metadata
    
    ### What changes were proposed in this pull request?
    
    The string-from-metadata and binary-from-metadata types were included in 
the initial spec, but never implemented for Spark 4.0 due to complexity and 
lack of a compelling use case. This PR removes them from the spec to align with 
the implementation. Nothing prevents us from adding these in the future, but 
Spark 4.0 would presumably not be able to read such a value, so having it in 
the spec at this point is confusing.
    
    ### Why are the changes needed?
    
    Clarifies Spark behavior.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    It is a README-only change.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47917 from cashmand/SPARK-49449.
    
    Authored-by: cashmand <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 common/variant/README.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/common/variant/README.md b/common/variant/README.md
index 391815dabf99..a66d708da75b 100644
--- a/common/variant/README.md
+++ b/common/variant/README.md
@@ -322,8 +322,6 @@ Each `array_val` and `object_val` must contain exactly 
`num_elements + 1` values
 
 The "short string" basic type may be used as an optimization to fold string 
length into the type byte for strings less than 64 bytes. It is semantically 
identical to the "string" primitive type.
 
-String and binary values may also be represented as an index into the metadata 
dictionary. (See “string from metadata” and “binary from metadata” in the 
“Primitive Types” table) Writers may choose to use this mechanism to avoid 
repeating identical string values in a Variant object.
-
 The Decimal type contains a scale, but no precision. The implied precision of 
a decimal value is `floor(log_10(val)) + 1`.
 
 # Encoding types
@@ -354,8 +352,6 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | float                       | `14`    | FLOAT                       | IEEE 
little-endian                                                                   
                               |
 | binary                      | `15`    | BINARY                      | 4 byte 
little-endian size, followed by bytes                                           
                             |
 | string                      | `16`    | STRING                      | 4 byte 
little-endian size, followed by UTF-8 encoded bytes                             
                             |
-| binary from metadata        | `17`    | BINARY                      | 
Little-endian index into the metadata dictionary. Number of bytes is equal to 
the metadata `offset_size`.           |
-| string from metadata        | `18`    | STRING                      | 
Little-endian index into the metadata dictionary. Number of bytes is equal to 
the metadata `offset_size`.           |
 | year-month interval         | `19`    | INT(32, signed)<sup>1</sup> | 1 byte 
denoting start field (1 bit) and end field (1 bit) starting at LSB followed by 
4-byte little-endian value.   |
 | day-time interval           | `20`    | INT(64, signed)<sup>1</sup> | 1 byte 
denoting start field (2 bits) and end field (2 bits) starting at LSB followed 
by 8-byte little-endian value. |
 
@@ -368,6 +364,8 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 
 The year-month and day-time interval types have one byte at the beginning 
indicating the start and end fields. In the case of the year-month interval, 
the least significant bit denotes the start field and the next least 
significant bit denotes the end field. The remaining 6 bits are unused. A field 
value of 0 represents YEAR and 1 represents MONTH. In the case of the day-time 
interval, the least significant 2 bits denote the start field and the next 
least significant 2 bits denote the en [...]
 
+Type IDs 17 and 18 were originally reserved for a prototype feature 
(string-from-metadata) that was never implemented. These IDs are available for 
use by new types.
+
 [1] The parquet format does not have pure equivalents for the year-month and 
day-time interval types. Year-month intervals are usually represented using 
int32 values and the day-time intervals are usually represented using int64 
values. However, these values don't include the start and end fields of these 
types. Therefore, Spark stores them in the column metadata.
 
 # Field ID order and uniqueness


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-49449][SQL][DOCS] Remove string-from-metadata and binary-from-metadata

Reply via email to