Re: [PR] Fix: Arrow vectorized reads for TIMESTAMP_MILLIS and shaded JAR initialization [iceberg]

via GitHub Thu, 06 Nov 2025 23:52:07 -0800


shubham19may commented on code in PR #14499:
URL: https://github.com/apache/iceberg/pull/14499#discussion_r2502014838



##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java:
##########
@@ -541,31 +541,41 @@ public Optional<LogicalTypeVisitorResult> visit(
     @Override
     public Optional<LogicalTypeVisitorResult> visit(
         LogicalTypeAnnotation.TimestampLogicalTypeAnnotation 
timestampLogicalType) {
-      FieldVector vector = arrowField.createVector(rootAlloc);
       switch (timestampLogicalType.getUnit()) {
         case MILLIS:
-          ((BigIntVector) vector).allocateNew(batchSize);
+          Field bigIntField =
+              new Field(
+                  icebergField.name(),
+                  new FieldType(
+                      icebergField.isOptional(), new ArrowType.Int(Long.SIZE, 
true), null, null),
+                  null);
+          FieldVector millisVector = bigIntField.createVector(rootAlloc);
+          ((BigIntVector) millisVector).allocateNew(batchSize);

Review Comment:
   Okey, so let me explain in a bit more detail.
    
   Yes, you understood it correctly, `TimestampMillisReader.nextVal` reads the 
value and converts them to long microseconds.
    
   And, It’s not “luck” - it’s an intentional design in both arrow and 
iceberg’s architecture.
    
   1. All 64-bit fixed-width vectors in arrow share the same underlying memory 
layout by design ([check 
here](https://github.com/apache/iceberg/blob/bbf5b8a9060932931275160d262af314c5c1f336/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java#L474))
    
   `BigIntVector` : generic 64-bit signed integer
   `TimeStampMicroVector` : 64-bit timestamps (microseconds)
   `TimeStampMilliTZVector` : 64-bit timestamps (milliseconds)
    
   2. According to Iceberg’s convention, all timestamps are internally 
represented as microseconds, i.e., longs. ([check 
here](https://github.com/apache/iceberg/blob/bbf5b8a9060932931275160d262af314c5c1f336/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java#L474))
   3. According to parquet’s specification, `TIMESTAMP_MILLIS` is physically 
stored as `INT64`
    
   The `BigIntVector` , would serve as the perfect container for the “converted 
INT64 values” - values that are read from Parquet, converted to microseconds, 
and stored as raw longs. This is more accurate than using 
`TimeStampMilliTZVector` (which would suggest the data is still in 
milliseconds).
    
   The semantic difference is in metadata, not memory layout.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix: Arrow vectorized reads for TIMESTAMP_MILLIS and shaded JAR initialization [iceberg]

Reply via email to