linliu-code opened a new pull request, #18794: URL: https://github.com/apache/hudi/pull/18794
### Change Logs Fixes apache/hudi#18752: the Spark write path used to silently ignore both `spark.sql.parquet.outputTimestampType` (the standard Spark setting) and `hoodie.parquet.outputtimestamptype` (the documented Hudi override), always emitting `TIMESTAMP(MICROS)` for `TimestampType` columns. Spark's own writer under the same SparkSession honors both. The bug spans 0.15.0 → 1.1.1 → master HEAD. This is silent broken interop with downstream readers that expect `TIMESTAMP(MILLIS)` (smaller files) or `INT96` (legacy Hive/Impala) — no error, no warning, the data just lands in the wrong logical type. ### Root causes (two layers) **1. `HoodieRowParquetWriteSupport`** (the Row-based bulk_insert writer) - Constructor unconditionally set the hadoopConf to `config.getStringOrDefault(...)` for the Hudi key, which always returned its default (`TIMESTAMP_MICROS`) and silently overrode any value the user had configured on the SparkSession. - The `TimestampType` writer derived the encoding from the avro writer schema's precision (always MICROS for Spark TimestampType) and lacked an INT96 path entirely. - The custom `MessageType` converter (called by parquet's `init()`) hard-coded `TIMESTAMP(MICROS)` for Spark TimestampType regardless of the chosen output type. **2. `HoodieSparkSchemaConverters`** (the Spark→Avro conversion used by the upsert path) - `TimestampType` → `HoodieSchema.createTimestampMicros()` was hard-coded, so the avro→parquet pipeline (`HoodieAvroWriteSupport`) could only emit MICROS, regardless of the user's setting. ### Fix Introduces `HoodieRowParquetWriteSupport.resolveOutputTimestampType` with documented priority: 1. `hoodie.parquet.outputtimestamptype` when explicitly set (compared against default value to distinguish from default-population). 2. `spark.sql.parquet.outputTimestampType` from the SparkSession's `SQLConf` when user-set (`SQLConf.contains` distinguishes user-set from Spark's own default). 3. Manually-propagated `spark.sql.parquet.outputTimestampType` in the Hadoop conf. 4. The Hudi default (`TIMESTAMP_MICROS`). In `HoodieRowParquetWriteSupport`: - `makeWriter(TimestampType)` dispatches on the resolved output type: emit INT96 binary (Julian-day/nanos-of-day per the [parquet-format spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp)) for `INT96`, INT64+MILLIS for `TIMESTAMP_MILLIS`, INT64+MICROS otherwise. - `convertField(TimestampType)` dispatches the same way to produce a matching parquet schema (so writer and schema agree). - Adds `microsToInt96Binary` helper implementing the standard encoding. In `HoodieSparkSchemaConverters`: - `TimestampType` consults `SQLConf` and produces `HoodieSchema.createTimestampMillis()` when the user requested `TIMESTAMP_MILLIS`, else `createTimestampMicros()` as before. ### Known limit (documented in test class) INT96 is bulk_insert-only. The upsert path goes through Avro and Avro doesn't model INT96, so INT96 requests fall through to MICROS at the avro layer. The fix delivers the full matrix for the bulk_insert path and MILLIS/MICROS for the upsert path — covering the realistic use cases (downstream readers expecting MILLIS for smaller files, or INT96 for legacy Hive/Impala interop where users typically already use bulk_insert). ### Impact Describe any public API changes. No public API change. Internal write path only. ### Risk level Medium. The Hudi-config-vs-Spark-config priority change is intentional — users who previously relied on the silent default override will see Hudi now honor their explicit Spark setting. Users who explicitly set `hoodie.parquet.outputtimestamptype` continue to win (priority 1). The avro-path change (`HoodieSparkSchemaConverters`) means downstream avro-aware code now sees `timestamp-millis` instead of `timestamp-micros` when the user requested MILLIS. This is the same behavior as Spark's native parquet writer. ### Documentation Update No documentation changes required. The behavior now matches what `hoodie.parquet.outputtimestamptype` was already documented to do. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable (TestOutputTimestampType functional tests cover bulk_insert × {MICROS,MILLIS,INT96}, upsert × {MICROS,MILLIS}, and the Hudi-vs-Spark priority chain) - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
