yihua opened a new pull request, #19029:
URL: https://github.com/apache/hudi/pull/19029

   ### Change Logs
   
   **Background.** Earlier Hudi versions mishandled long-backed timestamp 
logical types in `AvroInternalSchemaConverter`:
   1. `timestamp-millis` and `timestamp-micros` both collapsed into a single 
internal `TimestampType` and were always re-emitted as `timestampMicros()` on 
serialize. A source schema declaring `timestamp-millis` got persisted in the 
table with the wrong `timestamp-micros` logical type, while the underlying 
`long` values written to parquet remained epoch-millis. Pure logical-type drift.
   2. `local-timestamp-millis` and `local-timestamp-micros` had no branch at 
all. They fell through to the bare `LongType`, and the logical type was dropped 
from the table schema entirely. Logical-type loss.
   
   In both cases the parquet values are correct, only the logical type on the 
field is wrong. Current converters recognize all four logical types as 
distinct, so the writer schema now declares the correct logical type. On every 
subsequent write the reconcile path compares writer schema against the 
persisted table schema, finds the logical-type mismatch, and rejects it.
   
   **Why writes get blocked.** When 
`hoodie.write.set.null.for.missing.columns=true`, 
`HoodieSchemaUtils.deduceWriterSchema` calls 
`AvroSchemaEvolutionUtils.reconcileSchema`, which goes through 
`TableChanges.ColumnUpdateChange.updateColumnType` and 
`SchemaChangeUtils.isTypeUpdateAllow`. That switch had no case for `TIMESTAMP` 
or `TIMESTAMP_MILLIS`, so any precision/logical-type change fell into `default: 
return false` and threw `SchemaCompatibilityException`. Streams that need 
null-fill on missing columns cannot write to any table previously persisted 
with long-backed timestamp columns mishandled this way.
   
   The non-reconcile path (`set.null=false`) is unaffected: it skips 
`reconcileSchema` and lets 
`AvroSchemaCompatibility.checkReaderWriterCompatibility` validate, which is 
logical-type-blind (both timestamps are `long` underneath) and accepts the 
correction. So the reconcile path was strictly stricter than the non-reconcile 
path for the same scenario; only the `set.null=true` path was broken.
   
   **Fix.** New write config 
`hoodie.write.schema.allow.timestamp.precision.evolution` (default `false`) 
that, when `true`, lets `SchemaChangeUtils.isTypeUpdateAllow` permit:
   - `timestamp-millis ↔ timestamp-micros` (logical-type drift case, both 
directions)
   - `local-timestamp-millis ↔ local-timestamp-micros` (precision swap among 
the recognized variants)
   - `long → local-timestamp-millis` / `long → local-timestamp-micros` 
(logical-type loss case, attach the missing logical type)
   
   Default `false` preserves the existing strict rejection. Opt-in is per-write 
by setting the config to `true`. The flag is threaded into 
`AvroSchemaEvolutionUtils.reconcileSchema` and 
`TableChanges.ColumnUpdateChange`, and read from the write properties by 
`HoodieSchemaUtils.scala`, `BaseHoodieWriteClient`, `HoodieMergeHelper`, and 
`FileGroupReaderBasedMergeHandle`.
   
   **Why not unconditional.** Default writes should not silently change column 
logical types; the gate keeps the opt-in explicit and per-write. Older readers 
that don't recognize the new logical-type values degrade gracefully: 
`AvroInternalSchemaConverter` falls through to the bare primitive for 
unrecognized logical types, so the corrected logical type is simply ignored on 
the reader side.
   
   Complements the read-side repair from #14161, which handles parquet files 
carrying the wrong logical type transparently until the table schema is 
corrected. This PR closes the write-side gap so the table schema itself can be 
brought into agreement with the writer schema once and for all.
   
   ### Impact
   
   - New advanced write config, opt-in. Default preserves the prior strict 
behavior; no existing caller sees a change.
   - One new overload on `AvroSchemaEvolutionUtils.reconcileSchema`, one new 
factory on `TableChanges.ColumnUpdateChange.get`. Pre-existing overloads kept 
as delegates.
   - Plumbing through `HoodieSchemaUtils`, `BaseHoodieWriteClient`, 
`HoodieMergeHelper`, `FileGroupReaderBasedMergeHandle`.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   Opt-in gate with default-false means existing writers are unchanged. 
Positive variants exercise the gated repair path on the v6/v8/CURRENT 
logical-repair fixtures from #14161. Negative variant asserts 
`SchemaCompatibilityException` when the reconcile path is on with the gate 
closed, locking in the default behavior.
   
   ### Documentation Update
   
   New config documented inline on 
`HoodieCommonConfig.ALLOW_TIMESTAMP_PRECISION_EVOLUTION` with 
`sinceVersion("1.3.0")`.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to