MaxGekk opened a new pull request, #56651:
URL: https://github.com/apache/spark/pull/56651
### What changes were proposed in this pull request?
This is a backport of #56633 (master commit
24e0663709335f5a1d8d5d3e0d45658bec3259f2) to `branch-4.2`.
It makes Spark's Avro encoding of the `TIME` data type unit-correct.
`TimeType` is represented internally as nanoseconds-since-midnight, but the
Avro path annotated the column with the `time-micros` logical type while
writing the raw nanosecond value, so the declared unit (microseconds) did not
match the stored unit (nanoseconds).
The fix converts the value to match the logical type:
- Write path (`AvroSerializer`): `nanos -> micros`
(`DateTimeUtils.nanosToMicros`) before writing under `time-micros`.
- Read path (`AvroDeserializer`): `micros -> nanos`
(`DateTimeUtils.microsToNanos`) when reading a `time-micros` value into
`TimeType`.
`SchemaConverters` is unchanged: `time-micros` is the correct unit-matching
logical type for precision 0-6, and the `spark.sql.catalyst.type` property
continues to carry precision fidelity for Spark-to-Spark round-trips.
Backport note: the production change applied cleanly. The only cherry-pick
conflict was in `AvroSuite.scala`, because the master change places the TIME
tests next to the `SPARK-57166: nanosecond timestamp types are not supported in
Avro` test, which does not exist on `branch-4.2` (nanosecond timestamp types
are a master/`branch-4.x` feature). Resolved by placing the five TIME tests in
the base `AvroSuite` (so they run under both `AvroV1Suite` and `AvroV2Suite`)
and omitting the unrelated `SPARK-57166` test. The net production/test logic is
identical to master.
### Why are the changes needed?
Any external Avro reader (Hive, Trino, Flink, fastavro, etc.) that honors
the `time-micros` logical type would decode a Spark-written `TIME` column as
microseconds-since-midnight while it actually held nanoseconds-since-midnight -
a 1000x error that also falls outside the valid micros-of-day range. The
TIME-in-Avro support is present in the 4.2 line (SPARK-54473), so the bug needs
to be fixed before 4.2.0 GA.
### Does this PR introduce _any_ user-facing change?
Yes, within the unreleased 4.2.0 line. The on-disk encoding of a `TIME`
column written via Avro changes from raw nanoseconds (mislabeled as
`time-micros`) to actual microseconds under `time-micros`. Avro files written
by earlier unreleased 4.2 builds are intentionally not migrated (accepted
break). Spark-to-Spark read/write of `TIME` over Avro continues to round-trip
correctly.
### How was this patch tested?
Ran the TIME Avro tests on this `branch-4.2` backport:
`AvroV1Suite`/`AvroV2Suite` ("TIME type read/write with Avro format", "TIME
type in nested structures in Avro", "TIME type precision metadata is preserved
in Avro", "SPARK-57581: TIME is written as unit-correct time-micros for
external readers", "SPARK-57581: TIME read from a plain time-micros Avro file
(no catalyst prop)") - 10 tests (5 x V1/V2), all pass. `dev/scalastyle` is
clean.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]