MaxGekk opened a new pull request, #56651:
URL: https://github.com/apache/spark/pull/56651

   ### What changes were proposed in this pull request?
   
   This is a backport of #56633 (master commit 
24e0663709335f5a1d8d5d3e0d45658bec3259f2) to `branch-4.2`.
   
   It makes Spark's Avro encoding of the `TIME` data type unit-correct. 
`TimeType` is represented internally as nanoseconds-since-midnight, but the 
Avro path annotated the column with the `time-micros` logical type while 
writing the raw nanosecond value, so the declared unit (microseconds) did not 
match the stored unit (nanoseconds).
   
   The fix converts the value to match the logical type:
   - Write path (`AvroSerializer`): `nanos -> micros` 
(`DateTimeUtils.nanosToMicros`) before writing under `time-micros`.
   - Read path (`AvroDeserializer`): `micros -> nanos` 
(`DateTimeUtils.microsToNanos`) when reading a `time-micros` value into 
`TimeType`.
   
   `SchemaConverters` is unchanged: `time-micros` is the correct unit-matching 
logical type for precision 0-6, and the `spark.sql.catalyst.type` property 
continues to carry precision fidelity for Spark-to-Spark round-trips.
   
   Backport note: the production change applied cleanly. The only cherry-pick 
conflict was in `AvroSuite.scala`, because the master change places the TIME 
tests next to the `SPARK-57166: nanosecond timestamp types are not supported in 
Avro` test, which does not exist on `branch-4.2` (nanosecond timestamp types 
are a master/`branch-4.x` feature). Resolved by placing the five TIME tests in 
the base `AvroSuite` (so they run under both `AvroV1Suite` and `AvroV2Suite`) 
and omitting the unrelated `SPARK-57166` test. The net production/test logic is 
identical to master.
   
   ### Why are the changes needed?
   
   Any external Avro reader (Hive, Trino, Flink, fastavro, etc.) that honors 
the `time-micros` logical type would decode a Spark-written `TIME` column as 
microseconds-since-midnight while it actually held nanoseconds-since-midnight - 
a 1000x error that also falls outside the valid micros-of-day range. The 
TIME-in-Avro support is present in the 4.2 line (SPARK-54473), so the bug needs 
to be fixed before 4.2.0 GA.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, within the unreleased 4.2.0 line. The on-disk encoding of a `TIME` 
column written via Avro changes from raw nanoseconds (mislabeled as 
`time-micros`) to actual microseconds under `time-micros`. Avro files written 
by earlier unreleased 4.2 builds are intentionally not migrated (accepted 
break). Spark-to-Spark read/write of `TIME` over Avro continues to round-trip 
correctly.
   
   ### How was this patch tested?
   
   Ran the TIME Avro tests on this `branch-4.2` backport: 
`AvroV1Suite`/`AvroV2Suite` ("TIME type read/write with Avro format", "TIME 
type in nested structures in Avro", "TIME type precision metadata is preserved 
in Avro", "SPARK-57581: TIME is written as unit-correct time-micros for 
external readers", "SPARK-57581: TIME read from a plain time-micros Avro file 
(no catalyst prop)") - 10 tests (5 x V1/V2), all pass. `dev/scalastyle` is 
clean.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Cursor (Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to