gene-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1690470525
##########
common/utils/src/main/resources/error/error-conditions.json:
##########
@@ -4186,6 +4186,12 @@
],
"sqlState" : "42846"
},
+ "UNKNOWN_PRIMITIVE_TYPE_IN_VARIANT" : {
+ "message" : [
+ "Unknown primitive type with id <id> was found in a variant value. The
type might be supported in a newer version."
Review Comment:
I think we can remove the second part.
```suggestion
"Unknown primitive type with id <id> was found in a variant value."
```
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -214,6 +214,22 @@ public void appendTimestampNtz(long microsSinceEpoch) {
writePos += 8;
}
+ public void appendYearMonthInterval(long value, byte startField, byte
endField) {
+ checkCapacity(1 + 5);
+ writeBuffer[writePos++] = primitiveHeader(YEAR_MONTH_INTERVAL);
+ writeBuffer[writePos++] = (byte) (startField | (endField << 1));
Review Comment:
Should we mask `startField` and `endField` to make sure we are only using
the desired bits?
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##########
@@ -377,11 +402,51 @@ public static long getLong(byte[] value, int pos) {
case TIMESTAMP:
case TIMESTAMP_NTZ:
return readLong(value, pos + 1, 8);
+ case YEAR_MONTH_INTERVAL:
Review Comment:
Oh, I see how you updated `getLong`. Could you update the function comments
to include details about the new behavior?
##########
common/variant/README.md:
##########
@@ -335,27 +335,29 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| Object | `2` | A collection of (string-key, variant-value) pairs |
| Array | `3` | An ordered sequence of variant values |
-| Primitive Type | Type ID | Equivalent Parquet Type | Binary
format
|
-|-----------------------------|---------|---------------------------|-----------------------------------------------------------------------------------------------------------|
-| null | `0` | any | none
|
-| boolean (True) | `1` | BOOLEAN | none
|
-| boolean (False) | `2` | BOOLEAN | none
|
-| int8 | `3` | INT(8, signed) | 1 byte
|
-| int16 | `4` | INT(16, signed) | 2 byte
little-endian
|
-| int32 | `5` | INT(32, signed) | 4 byte
little-endian
|
-| int64 | `6` | INT(64, signed) | 8 byte
little-endian
|
-| double | `7` | DOUBLE | IEEE
little-endian
|
-| decimal4 | `8` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| decimal8 | `9` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| decimal16 | `10` | DECIMAL(precision, scale) | 1 byte
scale in range [0, 38], followed by little-endian unscaled value (see decimal
table) |
-| date | `11` | DATE | 4 byte
little-endian
|
-| timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte
little-endian
|
-| timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte
little-endian
|
-| float | `14` | FLOAT | IEEE
little-endian
|
-| binary | `15` | BINARY | 4 byte
little-endian size, followed by bytes
|
-| string | `16` | STRING | 4 byte
little-endian size, followed by UTF-8 encoded bytes
|
-| binary from metadata | `17` | BINARY |
Little-endian index into the metadata dictionary. Number of bytes is equal to
the metadata `offset_size`. |
-| string from metadata | `18` | STRING |
Little-endian index into the metadata dictionary. Number of bytes is equal to
the metadata `offset_size`. |
+| Primitive Type | Type ID | Equivalent Parquet Type
| Binary format
|
+|-----------------------------|---------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| null | `0` | any
| none
|
+| boolean (True) | `1` | BOOLEAN
| none
|
+| boolean (False) | `2` | BOOLEAN
| none
|
+| int8 | `3` | INT(8, signed)
| 1 byte
|
+| int16 | `4` | INT(16, signed)
| 2 byte little-endian
|
+| int32 | `5` | INT(32, signed)
| 4 byte little-endian
|
+| int64 | `6` | INT(64, signed)
| 8 byte little-endian
|
+| double | `7` | DOUBLE
| IEEE little-endian
|
+| decimal4 | `8` | DECIMAL(precision, scale)
| 1 byte scale in range [0, 38], followed by little-endian unscaled
value (see decimal table) |
+| decimal8 | `9` | DECIMAL(precision, scale)
| 1 byte scale in range [0, 38], followed by little-endian unscaled
value (see decimal table) |
+| decimal16 | `10` | DECIMAL(precision, scale)
| 1 byte scale in range [0, 38], followed by little-endian unscaled
value (see decimal table) |
+| date | `11` | DATE
| 4 byte little-endian
|
+| timestamp | `12` | TIMESTAMP(true, MICROS)
| 8-byte little-endian
|
+| timestamp without time zone | `13` | TIMESTAMP(false, MICROS)
| 8-byte little-endian
|
+| float | `14` | FLOAT
| IEEE little-endian
|
+| binary | `15` | BINARY
| 4 byte little-endian size, followed by bytes
|
+| string | `16` | STRING
| 4 byte little-endian size, followed by UTF-8 encoded bytes
|
+| binary from metadata | `17` | BINARY
| Little-endian index into the metadata dictionary. Number of bytes is
equal to the metadata `offset_size`. |
+| string from metadata | `18` | STRING
| Little-endian index into the metadata dictionary. Number of bytes is
equal to the metadata `offset_size`. |
+| year-month interval | `19` | YearMonthIntervalType(start_field,
end_field) | 1 byte denoting start field (1 bit) and end field (1 bit) starting
at LSB followed by 4-byte little-endian value. |
Review Comment:
What would the parquet types be for these intervals?
Also, we need to describe what the start/end fields are and how exactly they
are encoded in the byte.
##########
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java:
##########
@@ -316,6 +334,18 @@ static void toJsonImpl(byte[] value, byte[] metadata, int
pos, StringBuilder sb,
case BINARY:
appendQuoted(sb,
Base64.getEncoder().encodeToString(VariantUtil.getBinary(value, pos)));
break;
+ case YEAR_MONTH_INTERVAL:
+ IntervalFields ymFields =
VariantUtil.getYearMonthIntervalFields(value, pos);
+ int ymValue = (int) VariantUtil.getLong(value, pos);
Review Comment:
I'm not sure how this works. I thought the first byte at `pos` would include
the start and end info. Then, I thought the remaining would be a 4 byte int? My
confusion is:
- Shouldn't the `pos` read be `pos + 1`?
- Doesn't `VariantUtil.getLong(value, pos);` read variable length
variant-encoded int? I thought we just need to read a 4-byte int, since the 4
bytes won't have any variant type info?
(similar question for `DAY_TIME_INTERVAL` below)
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##########
@@ -377,11 +402,51 @@ public static long getLong(byte[] value, int pos) {
case TIMESTAMP:
case TIMESTAMP_NTZ:
return readLong(value, pos + 1, 8);
+ case YEAR_MONTH_INTERVAL:
+ return readLong(value, pos + 2, 4);
+ case DAY_TIME_INTERVAL:
+ return readLong(value, pos + 2, 8);
default:
throw new IllegalStateException(exceptionMessage);
}
}
+ public static class IntervalFields {
Review Comment:
Can you comment on this class, and what it represents?
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -214,6 +214,22 @@ public void appendTimestampNtz(long microsSinceEpoch) {
writePos += 8;
}
+ public void appendYearMonthInterval(long value, byte startField, byte
endField) {
+ checkCapacity(1 + 5);
+ writeBuffer[writePos++] = primitiveHeader(YEAR_MONTH_INTERVAL);
+ writeBuffer[writePos++] = (byte) (startField | (endField << 1));
+ writeLong(writeBuffer, writePos, value, 4);
+ writePos += 4;
+ }
+
+ public void appendDayTimeInterval(long value, byte startField, byte
endField) {
+ checkCapacity(1 + 9);
+ writeBuffer[writePos++] = primitiveHeader(DAY_TIME_INTERVAL);
+ writeBuffer[writePos++] = (byte) (startField | (endField << 2));
Review Comment:
Should we mask `startField` and `endField` to make sure we are only using
the desired bits?
Or, we could do checks on the inputs?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]