hudi-agent commented on code in PR #18794:
URL: https://github.com/apache/hudi/pull/18794#discussion_r3277580064
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/avro/HoodieSparkSchemaConverters.scala:
##########
@@ -123,7 +123,24 @@ object HoodieSparkSchemaConverters extends
SparkAdapterSupport {
case ByteType | ShortType | IntegerType =>
HoodieSchema.create(HoodieSchemaType.INT)
case LongType => HoodieSchema.create(HoodieSchemaType.LONG)
case DateType => HoodieSchema.createDate()
- case TimestampType => HoodieSchema.createTimestampMicros()
+ case TimestampType =>
+ // Honor spark.sql.parquet.outputTimestampType so the user's setting
flows through
+ // to the avroβparquet pipeline (HoodieAvroWriteSupport path). Avro
doesn't model
+ // INT96, so an INT96 request is downgraded to MICROS at the schema
level β only
+ // the Row-based bulk_insert writer (HoodieRowParquetWriteSupport) can
emit INT96.
+ // See apache/hudi#18752.
+ val outputTsType = try {
+ val sqlConf = SQLConf.get
+ val key = SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key
+ if (sqlConf.contains(key)) sqlConf.getConfString(key) else null
Review Comment:
π€ This honors `spark.sql.parquet.outputTimestampType` but not
`hoodie.parquet.outputtimestamptype`. So a user who only sets the Hudi-specific
key to `TIMESTAMP_MILLIS` (no Spark conf) will get MILLIS in the bulk_insert
path via `resolveOutputTimestampType` but MICROS in the upsert/avro path here.
Is that asymmetry intentional? The PR description calls out the INT96 downgrade
for avro, but this Hudi-key vs Spark-conf split is a different and more
surprising case. Worth either threading the Hudi config through this converter,
or documenting the limitation alongside `resolveOutputTimestampType`.
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -143,7 +145,17 @@ public HoodieRowParquetWriteSupport(Configuration conf,
StructType structType, O
Configuration hadoopConf = new Configuration(conf);
String writeLegacyFormatEnabled =
config.getStringOrDefault(HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED,
"false");
hadoopConf.set("spark.sql.parquet.writeLegacyFormat",
writeLegacyFormatEnabled);
- hadoopConf.set("spark.sql.parquet.outputTimestampType",
config.getStringOrDefault(HoodieStorageConfig.PARQUET_OUTPUT_TIMESTAMP_TYPE));
+ // Resolve the effective parquet output timestamp type. Priority:
+ // 1. Hudi's hoodie.parquet.outputtimestamptype, when the user
explicitly set it
+ // (this is the documented override mechanism).
+ // 2. Spark's own spark.sql.parquet.outputTimestampType, when the user
set it on the
+ // SparkSession but didn't set the Hudi-specific key.
+ // 3. The Hudi default (TIMESTAMP_MICROS).
+ // Previously this line unconditionally read the Hudi key with its default
value, which
+ // silently overrode any spark.sql.parquet.outputTimestampType the user
had configured.
+ String chosenOutputTimestampType = resolveOutputTimestampType(conf,
config);
Review Comment:
π€ nit: could you inline this directly β `this.outputTimestampType =
resolveOutputTimestampType(conf, config);`? The local
`chosenOutputTimestampType` just relays the value one line down without adding
any clarity.
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -841,4 +875,84 @@ private Type convertField(HoodieSchema fieldSchema,
StructField structField, Typ
throw new UnsupportedOperationException("Unsupported type: " + dataType);
}
}
+
+ // INT96 = julianDay(4 bytes) + nanosOfDay(8 bytes), little-endian.
+ //
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
Review Comment:
π€ nit: `MICROS_PER_DAY` and `JULIAN_DAY_OF_EPOCH` look like private
implementation details of `microsToInt96Binary` β have you considered adding
`private` to avoid unintentionally exposing them at package scope?
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -442,19 +454,32 @@ private ValueWriter makeWriter(HoodieSchema schema,
DataType dataType) {
} else if (dataType == DataTypes.LongType || dataType instanceof
DayTimeIntervalType) {
return (row, ordinal) -> recordConsumer.addLong(row.getLong(ordinal));
} else if (dataType == DataTypes.TimestampType) {
+ // Choose the encoding that matches outputTimestampType β same key
Spark's own
+ // ParquetWriteSupport uses to set the parquet schema, so writer and
schema agree.
+ // The Hudi avro writer schema only carries MICROS/MILLIS precision (no
INT96), so
+ // the previous version of this method drove the writer from schema
precision alone
+ // and silently ignored the user's outputTimestampType β see
apache/hudi#18752.
+ if ("INT96".equals(outputTimestampType)) {
Review Comment:
π€ The precedence between `outputTimestampType` and `resolvedSchema` diverges
across methods: here in `makeWriter` `outputTimestampType` wins (INT96
short-circuits at line 462), but in `convertField` (line 748) `resolvedSchema`
wins. So if a user sets `hoodie.parquet.outputtimestamptype=INT96` together
with `spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS` (which now makes
`HoodieSparkSchemaConverters` produce a `timestamp-millis` avro schema, so
`resolvedSchema` is MILLIS), this method emits 12-byte INT96 binary while
`convertField` declares INT64 TIMESTAMP(MILLIS). Parquet's `RecordConsumer`
won't validate the mismatch, so we'd write a malformed file. Could the
precedence be hoisted into a single helper that returns the effective type, so
writer and schema can't drift?
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]