This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new c5c880e690c3 [SPARK-49591][SQL] Add Logical Type column to variant 
readme
c5c880e690c3 is described below

commit c5c880e690c38b2bb597b7a38f20b32e2e2d272c
Author: cashmand <[email protected]>
AuthorDate: Thu Sep 12 22:35:57 2024 +0800

    [SPARK-49591][SQL] Add Logical Type column to variant readme
    
    ### What changes were proposed in this pull request?
    
    Add a concept of logical type to the variant README.md, distinct from the 
physical encoding of a value. In particular, decimal and integer values are 
considered to be members of a single "Exact Numeric" type.
    
    ### Why are the changes needed?
    
    This is intended to describe and justify the existing Spark behaviour for 
Variant (e.g. stripping trailing zeros for decimal to string casts), not change 
it. (Although the SchemaOfVariant expression does not strictly follow this 
right now for numeric types, and should be updated to match it.) The motivation 
for introducing a single numeric type that encompasses integer and decimal 
values is to allow more flexibility in storage (particularly once shredding is 
introduced), and provide a [...]
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    It is a documentation change.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #48064 from cashmand/cashmand/SPARK-49591.
    
    Authored-by: cashmand <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 common/variant/README.md | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/common/variant/README.md b/common/variant/README.md
index a66d708da75b..4ed7c16f5b6e 100644
--- a/common/variant/README.md
+++ b/common/variant/README.md
@@ -333,27 +333,27 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | Object       | `2` | A collection of (string-key, variant-value) pairs |
 | Array        | `3` | An ordered sequence of variant values             |
 
-| Primitive Type              | Type ID | Equivalent Parquet Type     | Binary 
format                                                                          
                             |
-|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
-| null                        | `0`     | any                         | none   
                                                                                
                             |
-| boolean (True)              | `1`     | BOOLEAN                     | none   
                                                                                
                             |
-| boolean (False)             | `2`     | BOOLEAN                     | none   
                                                                                
                             |
-| int8                        | `3`     | INT(8, signed)              | 1 byte 
                                                                                
                             |
-| int16                       | `4`     | INT(16, signed)             | 2 byte 
little-endian                                                                   
                             |
-| int32                       | `5`     | INT(32, signed)             | 4 byte 
little-endian                                                                   
                             |
-| int64                       | `6`     | INT(64, signed)             | 8 byte 
little-endian                                                                   
                             |
-| double                      | `7`     | DOUBLE                      | IEEE 
little-endian                                                                   
                               |
-| decimal4                    | `8`     | DECIMAL(precision, scale)   | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)                         |
-| decimal8                    | `9`     | DECIMAL(precision, scale)   | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)                         |
-| decimal16                   | `10`    | DECIMAL(precision, scale)   | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)                         |
-| date                        | `11`    | DATE                        | 4 byte 
little-endian                                                                   
                             |
-| timestamp                   | `12`    | TIMESTAMP(true, MICROS)     | 8-byte 
little-endian                                                                   
                             |
-| timestamp without time zone | `13`    | TIMESTAMP(false, MICROS)    | 8-byte 
little-endian                                                                   
                             |
-| float                       | `14`    | FLOAT                       | IEEE 
little-endian                                                                   
                               |
-| binary                      | `15`    | BINARY                      | 4 byte 
little-endian size, followed by bytes                                           
                             |
-| string                      | `16`    | STRING                      | 4 byte 
little-endian size, followed by UTF-8 encoded bytes                             
                             |
-| year-month interval         | `19`    | INT(32, signed)<sup>1</sup> | 1 byte 
denoting start field (1 bit) and end field (1 bit) starting at LSB followed by 
4-byte little-endian value.   |
-| day-time interval           | `20`    | INT(64, signed)<sup>1</sup> | 1 byte 
denoting start field (2 bits) and end field (2 bits) starting at LSB followed 
by 8-byte little-endian value. |
+| Logical Type         | Physical Type               | Type ID | Equivalent 
Parquet Type     | Binary format                                                
                                                       |
+|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
+| NullType             | null                        | `0`     | any           
              | none                                                            
                                                    |
+| Boolean              | boolean (True)              | `1`     | BOOLEAN       
              | none                                                            
                                                    |
+| Boolean              | boolean (False)             | `2`     | BOOLEAN       
              | none                                                            
                                                    |
+| Exact Numeric        | int8                        | `3`     | INT(8, 
signed)              | 1 byte                                                   
                                                           |
+| Exact Numeric        | int16                       | `4`     | INT(16, 
signed)             | 2 byte little-endian                                      
                                                          |
+| Exact Numeric        | int32                       | `5`     | INT(32, 
signed)             | 4 byte little-endian                                      
                                                          |
+| Exact Numeric        | int64                       | `6`     | INT(64, 
signed)             | 8 byte little-endian                                      
                                                          |
+| Double               | double                      | `7`     | DOUBLE        
              | IEEE little-endian                                              
                                                    |
+| Exact Numeric        | decimal4                    | `8`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
+| Exact Numeric        | decimal8                    | `9`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
+| Exact Numeric        | decimal16                   | `10`    | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
+| Date                 | date                        | `11`    | DATE          
              | 4 byte little-endian                                            
                                                    |
+| Timestamp            | timestamp                   | `12`    | 
TIMESTAMP(true, MICROS)     | 8-byte little-endian                              
                                                                  |
+| TimestampNTZ         | timestamp without time zone | `13`    | 
TIMESTAMP(false, MICROS)    | 8-byte little-endian                              
                                                                  |
+| Float                | float                       | `14`    | FLOAT         
              | IEEE little-endian                                              
                                                    |
+| Binary               | binary                      | `15`    | BINARY        
              | 4 byte little-endian size, followed by bytes                    
                                                    |
+| String               | string                      | `16`    | STRING        
              | 4 byte little-endian size, followed by UTF-8 encoded bytes      
                                                    |
+| YMInterval           | year-month interval         | `19`    | INT(32, 
signed)<sup>1</sup> | 1 byte denoting start field (1 bit) and end field (1 bit) 
starting at LSB followed by 4-byte little-endian value.   |
+| DTInterval           | day-time interval           | `20`    | INT(64, 
signed)<sup>1</sup> | 1 byte denoting start field (2 bits) and end field (2 
bits) starting at LSB followed by 8-byte little-endian value. |
 
 | Decimal Precision     | Decimal value type |
 |-----------------------|--------------------|
@@ -362,6 +362,8 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | 18 <= precision <= 38 | int128             |
 | > 38                  | Not supported      |
 
+The *Logical Type* column indicates logical equivalence of physically encoded 
types. For example, a user expression operating on a string value containing 
"hello" should behave the same, whether it is encoded with the short string 
optimization, or long string encoding. Similarly, user expressions operating on 
an *int8* value of 1 should behave the same as a decimal16 with scale 2 and 
unscaled value 100.
+
 The year-month and day-time interval types have one byte at the beginning 
indicating the start and end fields. In the case of the year-month interval, 
the least significant bit denotes the start field and the next least 
significant bit denotes the end field. The remaining 6 bits are unused. A field 
value of 0 represents YEAR and 1 represents MONTH. In the case of the day-time 
interval, the least significant 2 bits denote the start field and the next 
least significant 2 bits denote the en [...]
 
 Type IDs 17 and 18 were originally reserved for a prototype feature 
(string-from-metadata) that was never implemented. These IDs are available for 
use by new types.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to