Re: [PR] fix(spark): honor spark.sql.parquet.outputTimestampType on the write path (#18752) [hudi]

via GitHub Wed, 20 May 2026 15:36:02 -0700


hudi-agent commented on code in PR #18794:
URL: https://github.com/apache/hudi/pull/18794#discussion_r3277580064



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/avro/HoodieSparkSchemaConverters.scala:
##########
@@ -123,7 +123,24 @@ object HoodieSparkSchemaConverters extends 
SparkAdapterSupport {
       case ByteType | ShortType | IntegerType => 
HoodieSchema.create(HoodieSchemaType.INT)
       case LongType => HoodieSchema.create(HoodieSchemaType.LONG)
       case DateType => HoodieSchema.createDate()
-      case TimestampType => HoodieSchema.createTimestampMicros()
+      case TimestampType =>
+        // Honor spark.sql.parquet.outputTimestampType so the user's setting 
flows through
+        // to the avro→parquet pipeline (HoodieAvroWriteSupport path). Avro 
doesn't model
+        // INT96, so an INT96 request is downgraded to MICROS at the schema 
level — only
+        // the Row-based bulk_insert writer (HoodieRowParquetWriteSupport) can 
emit INT96.
+        // See apache/hudi#18752.
+        val outputTsType = try {
+          val sqlConf = SQLConf.get
+          val key = SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key
+          if (sqlConf.contains(key)) sqlConf.getConfString(key) else null

Review Comment:
   🤖 This honors `spark.sql.parquet.outputTimestampType` but not 
`hoodie.parquet.outputtimestamptype`. So a user who only sets the Hudi-specific 
key to `TIMESTAMP_MILLIS` (no Spark conf) will get MILLIS in the bulk_insert 
path via `resolveOutputTimestampType` but MICROS in the upsert/avro path here. 
Is that asymmetry intentional? The PR description calls out the INT96 downgrade 
for avro, but this Hudi-key vs Spark-conf split is a different and more 
surprising case. Worth either threading the Hudi config through this converter, 
or documenting the limitation alongside `resolveOutputTimestampType`.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -143,7 +145,17 @@ public HoodieRowParquetWriteSupport(Configuration conf, 
StructType structType, O
     Configuration hadoopConf = new Configuration(conf);
     String writeLegacyFormatEnabled = 
config.getStringOrDefault(HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED,
 "false");
     hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeLegacyFormatEnabled);
-    hadoopConf.set("spark.sql.parquet.outputTimestampType", 
config.getStringOrDefault(HoodieStorageConfig.PARQUET_OUTPUT_TIMESTAMP_TYPE));
+    // Resolve the effective parquet output timestamp type. Priority:
+    //   1. Hudi's hoodie.parquet.outputtimestamptype, when the user 
explicitly set it
+    //      (this is the documented override mechanism).
+    //   2. Spark's own spark.sql.parquet.outputTimestampType, when the user 
set it on the
+    //      SparkSession but didn't set the Hudi-specific key.
+    //   3. The Hudi default (TIMESTAMP_MICROS).
+    // Previously this line unconditionally read the Hudi key with its default 
value, which
+    // silently overrode any spark.sql.parquet.outputTimestampType the user 
had configured.
+    String chosenOutputTimestampType = resolveOutputTimestampType(conf, 
config);

Review Comment:
   🤖 nit: could you inline this directly — `this.outputTimestampType = 
resolveOutputTimestampType(conf, config);`? The local 
`chosenOutputTimestampType` just relays the value one line down without adding 
any clarity.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -841,4 +875,84 @@ private Type convertField(HoodieSchema fieldSchema, 
StructField structField, Typ
       throw new UnsupportedOperationException("Unsupported type: " + dataType);
     }
   }
+
+  // INT96 = julianDay(4 bytes) + nanosOfDay(8 bytes), little-endian.
+  // 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp

Review Comment:
   🤖 nit: `MICROS_PER_DAY` and `JULIAN_DAY_OF_EPOCH` look like private 
implementation details of `microsToInt96Binary` — have you considered adding 
`private` to avoid unintentionally exposing them at package scope?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -442,19 +454,32 @@ private ValueWriter makeWriter(HoodieSchema schema, 
DataType dataType) {
     } else if (dataType == DataTypes.LongType || dataType instanceof 
DayTimeIntervalType) {
       return (row, ordinal) -> recordConsumer.addLong(row.getLong(ordinal));
     } else if (dataType == DataTypes.TimestampType) {
+      // Choose the encoding that matches outputTimestampType — same key 
Spark's own
+      // ParquetWriteSupport uses to set the parquet schema, so writer and 
schema agree.
+      // The Hudi avro writer schema only carries MICROS/MILLIS precision (no 
INT96), so
+      // the previous version of this method drove the writer from schema 
precision alone
+      // and silently ignored the user's outputTimestampType — see 
apache/hudi#18752.
+      if ("INT96".equals(outputTimestampType)) {

Review Comment:
   🤖 The precedence between `outputTimestampType` and `resolvedSchema` diverges 
across methods: here in `makeWriter` `outputTimestampType` wins (INT96 
short-circuits at line 462), but in `convertField` (line 748) `resolvedSchema` 
wins. So if a user sets `hoodie.parquet.outputtimestamptype=INT96` together 
with `spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS` (which now makes 
`HoodieSparkSchemaConverters` produce a `timestamp-millis` avro schema, so 
`resolvedSchema` is MILLIS), this method emits 12-byte INT96 binary while 
`convertField` declares INT64 TIMESTAMP(MILLIS). Parquet's `RecordConsumer` 
won't validate the mismatch, so we'd write a malformed file. Could the 
precedence be hoisted into a single helper that returns the effective type, so 
writer and schema can't drift?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(spark): honor spark.sql.parquet.outputTimestampType on the write path (#18752) [hudi]

Reply via email to