MaxGekk opened a new pull request, #56633:
URL: https://github.com/apache/spark/pull/56633
### What changes were proposed in this pull request?
This PR makes Spark's Avro encoding of the `TIME` data type unit-correct.
`TimeType` is represented internally as nanoseconds-since-midnight, but the
Avro path annotated the column with the `time-micros` logical type while
writing the raw nanosecond value. The declared unit (microseconds) did not
match the stored unit (nanoseconds).
The fix converts the value to match the logical type:
- Write path (`AvroSerializer`): `nanos -> micros`
(`DateTimeUtils.nanosToMicros`) before writing under `time-micros`.
- Read path (`AvroDeserializer`): `micros -> nanos`
(`DateTimeUtils.microsToNanos`) when reading a `time-micros` value into
`TimeType`.
`SchemaConverters` is unchanged: `time-micros` is now the correct
unit-matching logical type for precision 0-6, and the `spark.sql.catalyst.type`
property continues to carry precision fidelity for Spark-to-Spark round-trips.
Scope is precision 0-6 (`TimeType.MAX_PRECISION`); Avro 1.12 exposes no
`time-nanos` logical type, and precision 7-9 is not constructible yet, so it is
left to a follow-up.
### Why are the changes needed?
Any external Avro reader (Hive, Trino, Flink, fastavro, etc.) that honors
the `time-micros` logical type would decode a Spark-written `TIME` column as
microseconds-since-midnight while it actually held nanoseconds-since-midnight -
a 1000x error that also falls outside the valid micros-of-day range. This
affected all precisions. For comparison, the Parquet path is already
unit-correct (SPARK-57551).
### Does this PR introduce _any_ user-facing change?
Yes, but only within the unreleased `master`/`branch-4.x` line where
TIME-in-Avro was recently introduced; no released Spark version is affected.
The on-disk encoding of a `TIME` column written via Avro changes from raw
nanoseconds (mislabeled as `time-micros`) to actual microseconds under
`time-micros`. Avro files written by earlier unreleased builds are
intentionally not migrated (accepted break). Spark-to-Spark read/write of
`TIME` over Avro continues to round-trip correctly.
### How was this patch tested?
- Added a test in `AvroSuite` that writes a `TIME(p)` value for each
precision 0-6, decodes the file with a plain Avro
`DataFileReader`/`GenericDatumReader` (no Spark), and asserts the stored long
equals the expected micros-of-day, is within the valid micros-of-day range, and
is annotated with the `time-micros` logical type.
- Verified the existing TIME round-trip tests still pass:
`AvroV1Suite`/`AvroV2Suite` ("TIME type read/write with Avro format", "TIME
type in nested structures in Avro", "TIME type precision metadata is preserved
in Avro") and the `to_avro`/`from_avro` TIME tests in `AvroFunctionsSuite`.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]