harshmotw-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1690587491


##########
common/variant/README.md:
##########
@@ -335,27 +335,29 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | Object       | `2` | A collection of (string-key, variant-value) pairs |
 | Array        | `3` | An ordered sequence of variant values             |
 
-| Primitive Type              | Type ID | Equivalent Parquet Type   | Binary 
format                                                                          
                   |
-|-----------------------------|---------|---------------------------|-----------------------------------------------------------------------------------------------------------|
-| null                        | `0`     | any                       | none     
                                                                                
                 |
-| boolean (True)              | `1`     | BOOLEAN                   | none     
                                                                                
                 |
-| boolean (False)             | `2`     | BOOLEAN                   | none     
                                                                                
                 |
-| int8                        | `3`     | INT(8, signed)            | 1 byte   
                                                                                
                 |
-| int16                       | `4`     | INT(16, signed)           | 2 byte 
little-endian                                                                   
                   |
-| int32                       | `5`     | INT(32, signed)           | 4 byte 
little-endian                                                                   
                   |
-| int64                       | `6`     | INT(64, signed)           | 8 byte 
little-endian                                                                   
                   |
-| double                      | `7`     | DOUBLE                    | IEEE 
little-endian                                                                   
                     |
-| decimal4                    | `8`     | DECIMAL(precision, scale) | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)               |
-| decimal8                    | `9`     | DECIMAL(precision, scale) | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)               |
-| decimal16                   | `10`    | DECIMAL(precision, scale) | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)               |
-| date                        | `11`    | DATE                      | 4 byte 
little-endian                                                                   
                   |
-| timestamp                   | `12`    | TIMESTAMP(true, MICROS)   | 8-byte 
little-endian                                                                   
                   |
-| timestamp without time zone | `13`    | TIMESTAMP(false, MICROS)  | 8-byte 
little-endian                                                                   
                   |
-| float                       | `14`    | FLOAT                     | IEEE 
little-endian                                                                   
                     |
-| binary                      | `15`    | BINARY                    | 4 byte 
little-endian size, followed by bytes                                           
                   |
-| string                      | `16`    | STRING                    | 4 byte 
little-endian size, followed by UTF-8 encoded bytes                             
                   |
-| binary from metadata        | `17`    | BINARY                    | 
Little-endian index into the metadata dictionary. Number of bytes is equal to 
the metadata `offset_size`. |
-| string from metadata        | `18`    | STRING                    | 
Little-endian index into the metadata dictionary. Number of bytes is equal to 
the metadata `offset_size`. |
+| Primitive Type              | Type ID | Equivalent Parquet Type              
         | Binary format                                                        
                                               |
+|-----------------------------|---------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| null                        | `0`     | any                                  
         | none                                                                 
                                               |
+| boolean (True)              | `1`     | BOOLEAN                              
         | none                                                                 
                                               |
+| boolean (False)             | `2`     | BOOLEAN                              
         | none                                                                 
                                               |
+| int8                        | `3`     | INT(8, signed)                       
         | 1 byte                                                               
                                               |
+| int16                       | `4`     | INT(16, signed)                      
         | 2 byte little-endian                                                 
                                               |
+| int32                       | `5`     | INT(32, signed)                      
         | 4 byte little-endian                                                 
                                               |
+| int64                       | `6`     | INT(64, signed)                      
         | 8 byte little-endian                                                 
                                               |
+| double                      | `7`     | DOUBLE                               
         | IEEE little-endian                                                   
                                               |
+| decimal4                    | `8`     | DECIMAL(precision, scale)            
         | 1 byte scale in range [0, 38], followed by little-endian unscaled 
value (see decimal table)                         |
+| decimal8                    | `9`     | DECIMAL(precision, scale)            
         | 1 byte scale in range [0, 38], followed by little-endian unscaled 
value (see decimal table)                         |
+| decimal16                   | `10`    | DECIMAL(precision, scale)            
         | 1 byte scale in range [0, 38], followed by little-endian unscaled 
value (see decimal table)                         |
+| date                        | `11`    | DATE                                 
         | 4 byte little-endian                                                 
                                               |
+| timestamp                   | `12`    | TIMESTAMP(true, MICROS)              
         | 8-byte little-endian                                                 
                                               |
+| timestamp without time zone | `13`    | TIMESTAMP(false, MICROS)             
         | 8-byte little-endian                                                 
                                               |
+| float                       | `14`    | FLOAT                                
         | IEEE little-endian                                                   
                                               |
+| binary                      | `15`    | BINARY                               
         | 4 byte little-endian size, followed by bytes                         
                                               |
+| string                      | `16`    | STRING                               
         | 4 byte little-endian size, followed by UTF-8 encoded bytes           
                                               |
+| binary from metadata        | `17`    | BINARY                               
         | Little-endian index into the metadata dictionary. Number of bytes is 
equal to the metadata `offset_size`.           |
+| string from metadata        | `18`    | STRING                               
         | Little-endian index into the metadata dictionary. Number of bytes is 
equal to the metadata `offset_size`.           |
+| year-month interval         | `19`    | YearMonthIntervalType(start_field, 
end_field) | 1 byte denoting start field (1 bit) and end field (1 bit) starting 
at LSB followed by 4-byte little-endian value.   |

Review Comment:
   I ran the following Python script on a parquet table containing these 
interval types and found that these intervals are intrinsically stored as 
int/long and the type info is stored in the metadata. I'll update the table to 
reflect this.
   
   ```
   >>> import pyarrow.parquet as pq
   >>> table = 
pq.read_table('/home/harsh.motwani/tables/part-00000-tid-8067172485220669242-1687c1be-9e28-455a-817a-449a862b4a05-0-1-c000.snappy.parquet')
   >>> table.schema
   ymi0: int32 not null
   ymi1: int32 not null
   ymi2: int32
   dti0: int64
   dti1: int64
   -- schema metadata --
   org.apache.spark.version: '4.0.0'
   org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 
375
   
   >>> table.schema.metadata
   OrderedDict([(b'org.apache.spark.version', b'4.0.0'), 
(b'org.apache.spark.sql.parquet.row.metadata', 
b'{"type":"struct","fields":[{"name":"ymi0","type":"interval year to 
month","nullable":false,"metadata":{}},{"name":"ymi1","type":"interval 
year","nullable":false,"metadata":{}},{"name":"ymi2","type":"interval 
month","nullable":true,"metadata":{}},{"name":"dti0","type":"interval day to 
second","nullable":true,"metadata":{}},{"name":"dti1","type":"interval hour to 
minute","nullable":true,"metadata":{}}]}')])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to