Re: [PR] [SPARK-45891][SQL][PySpark][VARIANT] Add support for interval types in the Variant Spec [spark]

via GitHub Wed, 24 Jul 2024 14:56:27 -0700


gene-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1690470525



##########
common/utils/src/main/resources/error/error-conditions.json:
##########
@@ -4186,6 +4186,12 @@
     ],
     "sqlState" : "42846"
   },
+  "UNKNOWN_PRIMITIVE_TYPE_IN_VARIANT" : {
+    "message" : [
+      "Unknown primitive type with id <id> was found in a variant value. The 
type might be supported in a newer version."

Review Comment:
   I think we can remove the second part.
   ```suggestion
         "Unknown primitive type with id <id> was found in a variant value."
   ```



##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -214,6 +214,22 @@ public void appendTimestampNtz(long microsSinceEpoch) {
     writePos += 8;
   }
 
+  public void appendYearMonthInterval(long value, byte startField, byte 
endField) {
+    checkCapacity(1 + 5);
+    writeBuffer[writePos++] = primitiveHeader(YEAR_MONTH_INTERVAL);
+    writeBuffer[writePos++] = (byte) (startField | (endField << 1));

Review Comment:
   Should we mask `startField` and `endField` to make sure we are only using 
the desired bits?



##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##########
@@ -377,11 +402,51 @@ public static long getLong(byte[] value, int pos) {
       case TIMESTAMP:
       case TIMESTAMP_NTZ:
         return readLong(value, pos + 1, 8);
+      case YEAR_MONTH_INTERVAL:

Review Comment:
   Oh, I see how you updated `getLong`. Could you update the function comments 
to include details about the new behavior?



##########
common/variant/README.md:
##########
@@ -335,27 +335,29 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | Object       | `2` | A collection of (string-key, variant-value) pairs |
 | Array        | `3` | An ordered sequence of variant values             |
 
-| Primitive Type              | Type ID | Equivalent Parquet Type   | Binary 
format                                                                          
                   |
-|-----------------------------|---------|---------------------------|-----------------------------------------------------------------------------------------------------------|
-| null                        | `0`     | any                       | none     
                                                                                
                 |
-| boolean (True)              | `1`     | BOOLEAN                   | none     
                                                                                
                 |
-| boolean (False)             | `2`     | BOOLEAN                   | none     
                                                                                
                 |
-| int8                        | `3`     | INT(8, signed)            | 1 byte   
                                                                                
                 |
-| int16                       | `4`     | INT(16, signed)           | 2 byte 
little-endian                                                                   
                   |
-| int32                       | `5`     | INT(32, signed)           | 4 byte 
little-endian                                                                   
                   |
-| int64                       | `6`     | INT(64, signed)           | 8 byte 
little-endian                                                                   
                   |
-| double                      | `7`     | DOUBLE                    | IEEE 
little-endian                                                                   
                     |
-| decimal4                    | `8`     | DECIMAL(precision, scale) | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)               |
-| decimal8                    | `9`     | DECIMAL(precision, scale) | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)               |
-| decimal16                   | `10`    | DECIMAL(precision, scale) | 1 byte 
scale in range [0, 38], followed by little-endian unscaled value (see decimal 
table)               |
-| date                        | `11`    | DATE                      | 4 byte 
little-endian                                                                   
                   |
-| timestamp                   | `12`    | TIMESTAMP(true, MICROS)   | 8-byte 
little-endian                                                                   
                   |
-| timestamp without time zone | `13`    | TIMESTAMP(false, MICROS)  | 8-byte 
little-endian                                                                   
                   |
-| float                       | `14`    | FLOAT                     | IEEE 
little-endian                                                                   
                     |
-| binary                      | `15`    | BINARY                    | 4 byte 
little-endian size, followed by bytes                                           
                   |
-| string                      | `16`    | STRING                    | 4 byte 
little-endian size, followed by UTF-8 encoded bytes                             
                   |
-| binary from metadata        | `17`    | BINARY                    | 
Little-endian index into the metadata dictionary. Number of bytes is equal to 
the metadata `offset_size`. |
-| string from metadata        | `18`    | STRING                    | 
Little-endian index into the metadata dictionary. Number of bytes is equal to 
the metadata `offset_size`. |
+| Primitive Type              | Type ID | Equivalent Parquet Type              
         | Binary format                                                        
                                               |
+|-----------------------------|---------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| null                        | `0`     | any                                  
         | none                                                                 
                                               |
+| boolean (True)              | `1`     | BOOLEAN                              
         | none                                                                 
                                               |
+| boolean (False)             | `2`     | BOOLEAN                              
         | none                                                                 
                                               |
+| int8                        | `3`     | INT(8, signed)                       
         | 1 byte                                                               
                                               |
+| int16                       | `4`     | INT(16, signed)                      
         | 2 byte little-endian                                                 
                                               |
+| int32                       | `5`     | INT(32, signed)                      
         | 4 byte little-endian                                                 
                                               |
+| int64                       | `6`     | INT(64, signed)                      
         | 8 byte little-endian                                                 
                                               |
+| double                      | `7`     | DOUBLE                               
         | IEEE little-endian                                                   
                                               |
+| decimal4                    | `8`     | DECIMAL(precision, scale)            
         | 1 byte scale in range [0, 38], followed by little-endian unscaled 
value (see decimal table)                         |
+| decimal8                    | `9`     | DECIMAL(precision, scale)            
         | 1 byte scale in range [0, 38], followed by little-endian unscaled 
value (see decimal table)                         |
+| decimal16                   | `10`    | DECIMAL(precision, scale)            
         | 1 byte scale in range [0, 38], followed by little-endian unscaled 
value (see decimal table)                         |
+| date                        | `11`    | DATE                                 
         | 4 byte little-endian                                                 
                                               |
+| timestamp                   | `12`    | TIMESTAMP(true, MICROS)              
         | 8-byte little-endian                                                 
                                               |
+| timestamp without time zone | `13`    | TIMESTAMP(false, MICROS)             
         | 8-byte little-endian                                                 
                                               |
+| float                       | `14`    | FLOAT                                
         | IEEE little-endian                                                   
                                               |
+| binary                      | `15`    | BINARY                               
         | 4 byte little-endian size, followed by bytes                         
                                               |
+| string                      | `16`    | STRING                               
         | 4 byte little-endian size, followed by UTF-8 encoded bytes           
                                               |
+| binary from metadata        | `17`    | BINARY                               
         | Little-endian index into the metadata dictionary. Number of bytes is 
equal to the metadata `offset_size`.           |
+| string from metadata        | `18`    | STRING                               
         | Little-endian index into the metadata dictionary. Number of bytes is 
equal to the metadata `offset_size`.           |
+| year-month interval         | `19`    | YearMonthIntervalType(start_field, 
end_field) | 1 byte denoting start field (1 bit) and end field (1 bit) starting 
at LSB followed by 4-byte little-endian value.   |

Review Comment:
   What would the parquet types be for these intervals?
   
   Also, we need to describe what the start/end fields are and how exactly they 
are encoded in the byte.



##########
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java:
##########
@@ -316,6 +334,18 @@ static void toJsonImpl(byte[] value, byte[] metadata, int 
pos, StringBuilder sb,
       case BINARY:
         appendQuoted(sb, 
Base64.getEncoder().encodeToString(VariantUtil.getBinary(value, pos)));
         break;
+      case YEAR_MONTH_INTERVAL:
+        IntervalFields ymFields = 
VariantUtil.getYearMonthIntervalFields(value, pos);
+        int ymValue = (int) VariantUtil.getLong(value, pos);

Review Comment:
   I'm not sure how this works. I thought the first byte at `pos` would include 
the start and end info. Then, I thought the remaining would be a 4 byte int? My 
confusion is:
   - Shouldn't the `pos` read be `pos + 1`?
   - Doesn't `VariantUtil.getLong(value, pos);` read variable length 
variant-encoded int? I thought we just need to read a 4-byte int, since the 4 
bytes won't have any variant type info?
   
   (similar question for `DAY_TIME_INTERVAL` below)



##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##########
@@ -377,11 +402,51 @@ public static long getLong(byte[] value, int pos) {
       case TIMESTAMP:
       case TIMESTAMP_NTZ:
         return readLong(value, pos + 1, 8);
+      case YEAR_MONTH_INTERVAL:
+        return readLong(value, pos + 2, 4);
+      case DAY_TIME_INTERVAL:
+        return readLong(value, pos + 2, 8);
       default:
         throw new IllegalStateException(exceptionMessage);
     }
   }
 
+  public static class IntervalFields {

Review Comment:
   Can you comment on this class, and what it represents?



##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java:
##########
@@ -214,6 +214,22 @@ public void appendTimestampNtz(long microsSinceEpoch) {
     writePos += 8;
   }
 
+  public void appendYearMonthInterval(long value, byte startField, byte 
endField) {
+    checkCapacity(1 + 5);
+    writeBuffer[writePos++] = primitiveHeader(YEAR_MONTH_INTERVAL);
+    writeBuffer[writePos++] = (byte) (startField | (endField << 1));
+    writeLong(writeBuffer, writePos, value, 4);
+    writePos += 4;
+  }
+
+  public void appendDayTimeInterval(long value, byte startField, byte 
endField) {
+    checkCapacity(1 + 9);
+    writeBuffer[writePos++] = primitiveHeader(DAY_TIME_INTERVAL);
+    writeBuffer[writePos++] = (byte) (startField | (endField << 2));

Review Comment:
   Should we mask `startField` and `endField` to make sure we are only using 
the desired bits?
   
   Or, we could do checks on the inputs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45891][SQL][PySpark][VARIANT] Add support for interval types in the Variant Spec [spark]

Reply via email to