(parquet-format) branch master updated: GH-463: Add more types - time, nano timestamps, UUID to Variant spec (#464)

emkornfield Tue, 10 Dec 2024 14:48:43 -0800

This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new a3dda6a  GH-463: Add more types - time, nano timestamps, UUID to 
Variant spec (#464)
a3dda6a is described below

commit a3dda6ac6691b33525b75f230a320f98d4027f86
Author: Aihua Xu <[email protected]>
AuthorDate: Tue Dec 10 14:48:32 2024 -0800

    GH-463: Add more types - time, nano timestamps, UUID to Variant spec (#464)
    
    * Add more types - time, nano timestamps, UUID to Variant.
    
    * Update type names to align with Parquet logical type
    
    * Update logical type
    
    * Update VariantEncoding.md
    
    Co-authored-by: emkornfield <[email protected]>
    
    * Update VariantEncoding.md
    
    Co-authored-by: emkornfield <[email protected]>
    
    ---------
    
    Co-authored-by: emkornfield <[email protected]>
---
 VariantEncoding.md | 55 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/VariantEncoding.md b/VariantEncoding.md
index 53a2a68..2930c71 100644
--- a/VariantEncoding.md
+++ b/VariantEncoding.md
@@ -365,6 +365,7 @@ It is semantically identical to the "string" primitive type.
 The Decimal type contains a scale, but no precision. The implied precision of 
a decimal value is `floor(log_10(val)) + 1`.
 
 # Encoding types
+*Variant basic types*
 
 | Basic Type   | ID  | Description                                       |
 |--------------|-----|---------------------------------------------------|
@@ -373,25 +374,37 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | Object       | `2` | A collection of (string-key, variant-value) pairs |
 | Array        | `3` | An ordered sequence of variant values             |
 
-| Logical Type         | Physical Type               | Type ID | Equivalent 
Parquet Type     | Binary format                                                
                                                       |
-|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
-| NullType             | null                        | `0`     | any           
              | none                                                            
                                                    |
-| Boolean              | boolean (True)              | `1`     | BOOLEAN       
              | none                                                            
                                                    |
-| Boolean              | boolean (False)             | `2`     | BOOLEAN       
              | none                                                            
                                                    |
-| Exact Numeric        | int8                        | `3`     | INT(8, 
signed)              | 1 byte                                                   
                                                           |
-| Exact Numeric        | int16                       | `4`     | INT(16, 
signed)             | 2 byte little-endian                                      
                                                          |
-| Exact Numeric        | int32                       | `5`     | INT(32, 
signed)             | 4 byte little-endian                                      
                                                          |
-| Exact Numeric        | int64                       | `6`     | INT(64, 
signed)             | 8 byte little-endian                                      
                                                          |
-| Double               | double                      | `7`     | DOUBLE        
              | IEEE little-endian                                              
                                                    |
-| Exact Numeric        | decimal4                    | `8`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
-| Exact Numeric        | decimal8                    | `9`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
-| Exact Numeric        | decimal16                   | `10`    | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
-| Date                 | date                        | `11`    | DATE          
              | 4 byte little-endian                                            
                                                    |
-| Timestamp            | timestamp                   | `12`    | 
TIMESTAMP(true, MICROS)     | 8-byte little-endian                              
                                                                  |
-| TimestampNTZ         | timestamp without time zone | `13`    | 
TIMESTAMP(false, MICROS)    | 8-byte little-endian                              
                                                                  |
-| Float                | float                       | `14`    | FLOAT         
              | IEEE little-endian                                              
                                                    |
-| Binary               | binary                      | `15`    | BINARY        
              | 4 byte little-endian size, followed by bytes                    
                                                    |
-| String               | string                      | `16`    | STRING        
              | 4 byte little-endian size, followed by UTF-8 encoded bytes      
                                                    |
+*Variant primitive types*
+
+| Type Equivalence Class         | Physical Type               | Type ID | 
Equivalent Parquet Type     | Binary format                                     
                                          |
+|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------|
+| NullType             | null                        | `0`     | any           
              | none                                                            
                            |
+| Boolean              | boolean (True)              | `1`     | BOOLEAN       
              | none                                                            
                            |
+| Boolean              | boolean (False)             | `2`     | BOOLEAN       
              | none                                                            
                            |
+| Exact Numeric        | int8                        | `3`     | INT(8, 
signed)              | 1 byte                                                   
                                   |
+| Exact Numeric        | int16                       | `4`     | INT(16, 
signed)             | 2 byte little-endian                                      
                                  |
+| Exact Numeric        | int32                       | `5`     | INT(32, 
signed)             | 4 byte little-endian                                      
                                  |
+| Exact Numeric        | int64                       | `6`     | INT(64, 
signed)             | 8 byte little-endian                                      
                                  |
+| Double               | double                      | `7`     | DOUBLE        
              | IEEE little-endian                                              
                            |
+| Exact Numeric        | decimal4                    | `8`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table) |
+| Exact Numeric        | decimal8                    | `9`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table) |
+| Exact Numeric        | decimal16                   | `10`    | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table) |
+| Date                 | date                        | `11`    | DATE          
              | 4 byte little-endian                                            
                            |
+| Timestamp            | timestamp with time zone    | `12`    | 
TIMESTAMP(isAdjustedToUTC=true, MICROS)     | 8-byte little-endian              
                                                          |
+| TimestampNTZ         | timestamp without time zone | `13`    | 
TIMESTAMP(isAdjustedToUTC=false, MICROS)    | 8-byte little-endian              
                                                          |
+| Float                | float                       | `14`    | FLOAT         
              | IEEE little-endian                                              
                            |
+| Binary               | binary                      | `15`    | BINARY        
              | 4 byte little-endian size, followed by bytes                    
                            |
+| String               | string                      | `16`    | STRING        
              | 4 byte little-endian size, followed by UTF-8 encoded bytes      
                            |
+| TimeNTZ              | time without time zone      | `21`    | 
TIME(isAdjustedToUTC=false, MICROS)          | 8-byte little-endian             
                                                           |
+| Timestamp            | timestamp with time zone   | `22`    | 
TIMESTAMP(isAdjustedToUTC=true, NANOS)       | 8-byte little-endian             
                                                           |
+| TimestampNTZ         | timestamp without time zone | `23`    | 
TIMESTAMP(isAdjustedToUTC=false, NANOS)      | 8-byte little-endian             
                                                           |
+| UUID                 | uuid                        | `24`    | UUID          
               | 16-byte big-endian                                             
                            |
+
+The *Type Equivalence Class* column indicates logical equivalence of 
physically encoded types.
+For example, a user expression operating on a string value containing "hello" 
should behave the same, whether it is encoded with the short string 
optimization, or long string encoding.
+Similarly, user expressions operating on an *int8* value of 1 should behave 
the same as a decimal16 with scale 2 and unscaled value 100.
+
+*Decimal table*
 
 | Decimal Precision     | Decimal value type |
 |-----------------------|--------------------|
@@ -400,10 +413,6 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | 18 <= precision <= 38 | int128             |
 | > 38                  | Not supported      |
 
-The *Logical Type* column indicates logical equivalence of physically encoded 
types.
-For example, a user expression operating on a string value containing "hello" 
should behave the same, whether it is encoded with the short string 
optimization, or long string encoding.
-Similarly, user expressions operating on an *int8* value of 1 should behave 
the same as a decimal16 with scale 2 and unscaled value 100.
-
 # String values must be UTF-8 encoded
 
 All strings within the Variant binary format must be UTF-8 encoded.

(parquet-format) branch master updated: GH-463: Add more types - time, nano timestamps, UUID to Variant spec (#464)

Reply via email to