harshmotw-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1690587491
##########
common/variant/README.md:
##########
@@ -335,27 +335,29 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| Object | `2` | A collection of (string-key, variant-value) pairs |
| Array | `3` | An ordered sequence of variant values |
-| Primitive Type | Type ID | Equivalent Parquet Type | Binary
format
|
-|-----------------------------|---------|---------------------------|-----------------------------------------------------------------------------------------------------------|
-| null | `0` | any | none
|
-| boolean (True) | `1` | BOOLEAN | none
|
-| boolean (False) | `2` | BOOLEAN | none
|
-| int8 | `3` | INT(8, signed) | 1 byte
|
-| int16 | `4` | INT(16, signed) | 2 byte
little-endian
|
-| int32 | `5` | INT(32, signed) | 4 byte
little-endian
|
-| int64 | `6` | INT(64, signed) | 8 byte
little-endian
|
-| double | `7` | DOUBLE | IEEE
little-endian
|
-| decimal4 | `8` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| decimal8 | `9` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| decimal16 | `10` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| date | `11` | DATE | 4 byte
little-endian
|
-| timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte
little-endian
|
-| timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte
little-endian
|
-| float | `14` | FLOAT | IEEE
little-endian
|
-| binary | `15` | BINARY | 4 byte
little-endian size, followed by bytes
|
-| string | `16` | STRING | 4 byte
little-endian size, followed by UTF-8 encoded bytes
|
-| binary from metadata | `17` | BINARY |
Little-endian index into the metadata dictionary. Number of bytes is equal to
the metadata `offset_size`. |
-| string from metadata | `18` | STRING |
Little-endian index into the metadata dictionary. Number of bytes is equal to
the metadata `offset_size`. |
+| Primitive Type | Type ID | Equivalent Parquet Type
| Binary format
|
+|-----------------------------|---------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| null | `0` | any
| none
|
+| boolean (True) | `1` | BOOLEAN
| none
|
+| boolean (False) | `2` | BOOLEAN
| none
|
+| int8 | `3` | INT(8, signed)
| 1 byte
|
+| int16 | `4` | INT(16, signed)
| 2 byte little-endian
|
+| int32 | `5` | INT(32, signed)
| 4 byte little-endian
|
+| int64 | `6` | INT(64, signed)
| 8 byte little-endian
|
+| double | `7` | DOUBLE
| IEEE little-endian
|
+| decimal4 | `8` | DECIMAL(precision, scale)
| 1 byte scale in range [0, 38], followed by little-endian unscaled
value (see decimal table) |
+| decimal8 | `9` | DECIMAL(precision, scale)
| 1 byte scale in range [0, 38], followed by little-endian unscaled
value (see decimal table) |
+| decimal16 | `10` | DECIMAL(precision, scale)
| 1 byte scale in range [0, 38], followed by little-endian unscaled
value (see decimal table) |
+| date | `11` | DATE
| 4 byte little-endian
|
+| timestamp | `12` | TIMESTAMP(true, MICROS)
| 8-byte little-endian
|
+| timestamp without time zone | `13` | TIMESTAMP(false, MICROS)
| 8-byte little-endian
|
+| float | `14` | FLOAT
| IEEE little-endian
|
+| binary | `15` | BINARY
| 4 byte little-endian size, followed by bytes
|
+| string | `16` | STRING
| 4 byte little-endian size, followed by UTF-8 encoded bytes
|
+| binary from metadata | `17` | BINARY
| Little-endian index into the metadata dictionary. Number of bytes is
equal to the metadata `offset_size`. |
+| string from metadata | `18` | STRING
| Little-endian index into the metadata dictionary. Number of bytes is
equal to the metadata `offset_size`. |
+| year-month interval | `19` | YearMonthIntervalType(start_field,
end_field) | 1 byte denoting start field (1 bit) and end field (1 bit) starting
at LSB followed by 4-byte little-endian value. |
Review Comment:
I ran the following Python script on a parquet table containing these
interval types and found that these intervals are intrinsically stored as
int/long and the type info is stored in the metadata. I'll update the table to
reflect this.
```
>>> import pyarrow.parquet as pq
>>> table =
pq.read_table('/home/harsh.motwani/tables/part-00000-tid-8067172485220669242-1687c1be-9e28-455a-817a-449a862b4a05-0-1-c000.snappy.parquet')
>>> table.schema
ymi0: int32 not null
ymi1: int32 not null
ymi2: int32
dti0: int64
dti1: int64
-- schema metadata --
org.apache.spark.version: '4.0.0'
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' +
375
>>> table.schema.metadata
OrderedDict([(b'org.apache.spark.version', b'4.0.0'),
(b'org.apache.spark.sql.parquet.row.metadata',
b'{"type":"struct","fields":[{"name":"ymi0","type":"interval year to
month","nullable":false,"metadata":{}},{"name":"ymi1","type":"interval
year","nullable":false,"metadata":{}},{"name":"ymi2","type":"interval
month","nullable":true,"metadata":{}},{"name":"dti0","type":"interval day to
second","nullable":true,"metadata":{}},{"name":"dti1","type":"interval hour to
minute","nullable":true,"metadata":{}}]}')])
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]