Re: [PR] Fix: Arrow vectorized reads for TIMESTAMP_MILLIS and shaded JAR initialization [iceberg]

via GitHub Thu, 06 Nov 2025 23:54:00 -0800


shubham19may commented on code in PR #14499:
URL: https://github.com/apache/iceberg/pull/14499#discussion_r2502016735



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java:
##########
@@ -232,6 +233,81 @@ protected WriteSupport<InternalRow> 
getWriteSupport(Configuration configuration)
     }
   }
 
+  @Test
+  public void testTimestampMillisProducedBySparkIsReadCorrectly() throws 
IOException {
+    String outputFilePath =
+        String.format("%s/%s", temp.toAbsolutePath(), 
"parquet_timestamp_millis.parquet");
+    HadoopOutputFile outputFile =
+        HadoopOutputFile.fromPath(
+            new org.apache.hadoop.fs.Path(outputFilePath), new 
Configuration());
+
+    Schema schema = new Schema(required(1, "event_time", 
Types.TimestampType.withZone()));
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[] {
+              new StructField("event_time", DataTypes.TimestampType, false, 
Metadata.empty())
+            });
+
+    List<InternalRow> originalRows = 
Lists.newArrayList(RandomData.generateSpark(schema, 10, 0L));
+    List<InternalRow> rows = Lists.newArrayList();
+    for (InternalRow row : originalRows) {
+      long timestampMicros = row.getLong(0);
+      long timestampMillis = (timestampMicros / 1000) * 1000;
+      rows.add(
+          new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(
+              new Object[] {timestampMillis}));
+    }
+
+    try (ParquetWriter<InternalRow> writer =
+        new NativeSparkWriterBuilder(outputFile)
+            .set("org.apache.spark.sql.parquet.row.attributes", 
sparkSchema.json())
+            .set("spark.sql.parquet.writeLegacyFormat", "false")
+            .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
+            .set("spark.sql.parquet.fieldId.write.enabled", "true")
+            .build()) {
+      for (InternalRow row : rows) {
+        writer.write(row);
+      }
+    }

Review Comment:
   Well, the issue is, iceberg’s native parquet writers explicitly reject 
TIMESTAMP_MILLIS ([check 
here](https://github.com/apache/iceberg/blob/843bb46e27773e0e18df7ce3bd377a0b56f35074/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java#L238)),
 as by design iceberg standardizes on microsecond precision.
    
   TIMESTAMP_MILLS support exists only for reading externally-produced files 
(like from Spark)
    
   To write a test without Spark will add a lot of extra code, and that too 
using Parquet API. IMO, it’s better to have the current end to end test with 
Spark.



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java:
##########
@@ -232,6 +233,81 @@ protected WriteSupport<InternalRow> 
getWriteSupport(Configuration configuration)
     }
   }
 
+  @Test
+  public void testTimestampMillisProducedBySparkIsReadCorrectly() throws 
IOException {
+    String outputFilePath =
+        String.format("%s/%s", temp.toAbsolutePath(), 
"parquet_timestamp_millis.parquet");
+    HadoopOutputFile outputFile =
+        HadoopOutputFile.fromPath(
+            new org.apache.hadoop.fs.Path(outputFilePath), new 
Configuration());

Review Comment:
   Well, to use InMemoryOutputFile, we need to use ParquetIO which is a 
package-private. Ultimately would require an additional helper method which 
will bring more unnecessary code.
    
   Moreover, its sibling tests like 
testInt96TimestampProducedBySparkIsReadCorrectly() , do use disk-based 
approach, and it being an integration test, the current state looks more 
optimal IMO.
    
   Do you still want me to change it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix: Arrow vectorized reads for TIMESTAMP_MILLIS and shaded JAR initialization [iceberg]

Reply via email to