[PR] [HUDI-18606] fix(spark): handle Avro 1.12 logical type values in Spark 4.1 read path [hudi]

via GitHub Mon, 18 May 2026 12:57:50 -0700


yihua opened a new pull request, #18773:
URL: https://github.com/apache/hudi/pull/18773


   ### Describe the issue this Pull Request addresses
   
   Closes #18606
   
   Spark 4.1 pulls in Avro 1.12, which installs default `Conversion`s on 
`GenericData.get()` for date/time logical types. Generic records returned by 
`GenericDatumReader` now materialize `java.time.LocalDate` / 
`java.time.Instant` / `java.time.LocalDateTime` for fields that Avro 1.11.x 
(Spark 3.5 / 4.0) exposed as raw `Integer` / `Long`. This breaks Hudi's 
in-memory comparison and casting on the read path, e.g. `MERGE INTO` with a 
`timestamp` or `date` precombine field fails with `ClassCastException` in 
`DefaultHoodieRecordPayload.compareOrderingVal` (Instant vs Long) — and even 
after fixing the Spark deserializer, the same mismatch surfaces in 
`HoodieAvroUtils.getNestedFieldVal` for ordering-value extraction.
   
   ### Summary and Changelog
   
   Read-side normalization only — the on-disk byte format is unaffected and 
writer / reader cross-compatibility between Spark 3.5 / 4.0 and Spark 4.1 is 
preserved.
   
   - `hudi-common` `HoodieAvroUtils.convertValueForAvroLogicalTypes`: accepts 
both the Avro 1.11.x primitive form (`Integer` / `Long`) and the Avro 1.12 
`java.time` form (`LocalDate` / `Instant` / `LocalDateTime`), normalizing to 
the same canonical value (epoch-day / epoch-millis / epoch-micros). Added 
javadoc explaining the Avro 1.12 situation and why storage bytes are not 
affected. Added private `extract*` helpers.
   - `hudi-common` `HoodieAvroWrapperUtils.unwrapAvroValueWrapper(Object, 
String)`: fixed three unguarded `(Integer)` / `(Long)` casts on 
`GenericRecord.get(0)` for `DateWrapper` / `LocalDateWrapper` / 
`TimestampMicrosWrapper` via local helpers that accept both encodings.
   - `hudi-spark4.1.x` `AvroDeserializer.scala`: restored the `Instant` / 
`LocalDate` / `LocalDateTime` fallback branches in `(INT, IntegerType)`, `(INT, 
DateType)`, `(LONG, LongType)`, `(LONG, TimestampType)`, and `(LONG, 
TimestampNTZType)`. Added a block comment explaining the Avro 1.12 vs 1.11.x 
behavior and that the change is read-side only.
   - Re-enabled `TestMergeIntoTable.Test Different Type of PreCombineField` on 
Spark 4.1 (the previous `assume(!gteqSpark4_1, ...)` workaround is no longer 
needed).
   
   Tests added:
   - `TestHoodieAvroUtils.testConvertValueForAvroLogicalTypesCrossAvroVersion` 
— feeds both encodings for date / timestamp-millis / timestamp-micros / 
local-timestamp-millis / local-timestamp-micros and asserts identical canonical 
output.
   - 
`TestHoodieAvroUtils.testGetNestedFieldValOrderingInvariantAcrossAvroVersions` 
— builds two records (primitive vs java.time) and asserts `compareTo` returns 
0, the precise contract `DefaultHoodieRecordPayload.compareOrderingVal` relies 
on.
   - `TestSpark4_1AvroLogicalTypeBytes` (new, in `hudi-spark4.1.x`): asserts 
`HoodieSpark4_1AvroSerializer` emits raw `Long` / `Integer` (never java.time) 
into the `GenericRecord`, and that `GenericDatumWriter` output matches an 
independent zig-zag varlong encoding per the Avro spec. This pins the 
storage-byte invariant directly without needing to build both Spark profiles.
   
   ### Impact
   
   No public API change. No on-disk format change. Bug fix.
   
   ### Risk Level
   
   low
   
   The change is scoped to in-memory deserialization and ordering-value 
extraction in `hudi-common` and `hudi-spark4.1.x`. The write path is untouched: 
`AvroSerializer` (Spark 4.1) emits only primitive `Long` / `Int` into 
`GenericRecord`s, and `GenericDatumWriter` encodes those bytes per the Avro 
spec, identical to what Avro 1.11.x writes. The new write-side test enforces 
this contract.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-18606] fix(spark): handle Avro 1.12 logical type values in Spark 4.1 read path [hudi]

Reply via email to