[PR] [SPARK-57581][SQL] Encode the TIME data type in Avro with a unit-correct logical type [spark]

via GitHub Sat, 20 Jun 2026 03:04:32 -0700


MaxGekk opened a new pull request, #56633:
URL: https://github.com/apache/spark/pull/56633


   ### What changes were proposed in this pull request?
   
   This PR makes Spark's Avro encoding of the `TIME` data type unit-correct.
   
   `TimeType` is represented internally as nanoseconds-since-midnight, but the 
Avro path annotated the column with the `time-micros` logical type while 
writing the raw nanosecond value. The declared unit (microseconds) did not 
match the stored unit (nanoseconds).
   
   The fix converts the value to match the logical type:
   - Write path (`AvroSerializer`): `nanos -> micros` 
(`DateTimeUtils.nanosToMicros`) before writing under `time-micros`.
   - Read path (`AvroDeserializer`): `micros -> nanos` 
(`DateTimeUtils.microsToNanos`) when reading a `time-micros` value into 
`TimeType`.
   
   `SchemaConverters` is unchanged: `time-micros` is now the correct 
unit-matching logical type for precision 0-6, and the `spark.sql.catalyst.type` 
property continues to carry precision fidelity for Spark-to-Spark round-trips.
   
   Scope is precision 0-6 (`TimeType.MAX_PRECISION`); Avro 1.12 exposes no 
`time-nanos` logical type, and precision 7-9 is not constructible yet, so it is 
left to a follow-up.
   
   ### Why are the changes needed?
   
   Any external Avro reader (Hive, Trino, Flink, fastavro, etc.) that honors 
the `time-micros` logical type would decode a Spark-written `TIME` column as 
microseconds-since-midnight while it actually held nanoseconds-since-midnight - 
a 1000x error that also falls outside the valid micros-of-day range. This 
affected all precisions. For comparison, the Parquet path is already 
unit-correct (SPARK-57551).
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but only within the unreleased `master`/`branch-4.x` line where 
TIME-in-Avro was recently introduced; no released Spark version is affected. 
The on-disk encoding of a `TIME` column written via Avro changes from raw 
nanoseconds (mislabeled as `time-micros`) to actual microseconds under 
`time-micros`. Avro files written by earlier unreleased builds are 
intentionally not migrated (accepted break). Spark-to-Spark read/write of 
`TIME` over Avro continues to round-trip correctly.
   
   ### How was this patch tested?
   
   - Added a test in `AvroSuite` that writes a `TIME(p)` value for each 
precision 0-6, decodes the file with a plain Avro 
`DataFileReader`/`GenericDatumReader` (no Spark), and asserts the stored long 
equals the expected micros-of-day, is within the valid micros-of-day range, and 
is annotated with the `time-micros` logical type.
   - Verified the existing TIME round-trip tests still pass: 
`AvroV1Suite`/`AvroV2Suite` ("TIME type read/write with Avro format", "TIME 
type in nested structures in Avro", "TIME type precision metadata is preserved 
in Avro") and the `to_avro`/`from_avro` TIME tests in `AvroFunctionsSuite`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Cursor (Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57581][SQL] Encode the TIME data type in Avro with a unit-correct logical type [spark]

Reply via email to