MaxGekk opened a new pull request, #56778:
URL: https://github.com/apache/spark/pull/56778

   ### What changes were proposed in this pull request?
   This PR carries the `TimeType(p)` fractional-second precision `p` (in `[0, 
9]`) across the Spark <-> Arrow type mapping so that a `TIME(p)` column 
round-trips back to the same `TIME(p)`, instead of collapsing to the canonical 
`TIME(6)`.
   
   Arrow's `Time` logical type encodes only `(unit, bitWidth)` and has no 
fractional-precision field, so the precision cannot live in the `ArrowType` 
itself. It is instead carried in the Arrow field metadata under a dedicated key 
`SPARK::time::precision`, reusing the precision-in-field-metadata pattern 
introduced for the nanosecond timestamp types (SPARK-57159).
   
   - `ArrowUtils.toArrowField`: tag `TimeType(p)` fields with the precision 
metadata key, merged with the column metadata. The Arrow type stays 
`Time(NANOSECOND, 64)`.
   - `ArrowUtils.fromArrowField`: read that key to reconstruct `TimeType(p)`; 
when the key is absent (foreign Arrow data) or out of `[0, 9]`, fall back to 
the canonical `TimeType(MICROS_PRECISION)` (= 6) via `fromArrowType`, 
preserving today's behavior for non-Spark producers.
   - The shared precision-stashing helper `toTimestampNanosArrowField` is 
generalized to `toPrecisionTaggedArrowField`, parameterized by the metadata 
key, so the nanosecond timestamp types and `TIME` share it.
   
   `TimeTypeApiOps.toArrowType` and `TypeApiOps.fromArrowType` are unchanged: 
`toArrowType` keeps producing `Time(NANOSECOND, 64)`, and the metadata-less 
`fromArrowType` remains the canonical `TIME(6)` fallback.
   
   ### Why are the changes needed?
   `ArrowUtils` / the Types Framework currently map every `TimeType(p)` to 
`ArrowType.Time(NANOSECOND, 64)` (no precision field), and 
`TypeApiOps.fromArrowType` maps it back to a fixed `TimeType(6)`. As a result 
the declared precision is lost on any Arrow round-trip (`TIME(0)`, `TIME(3)`, 
`TIME(9)`, ... all read back as `TIME(6)`), so Arrow-based schema transfer 
(Connect schema/results, `createDataFrame` from Arrow, `mapInArrow`, etc.) 
silently widens or narrows the type label. The stored value is already 
nanosecond-resolution and is unaffected; this is purely a type-fidelity gap.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes (minor): a `TIME(p)` column transferred over Arrow now retains its 
declared precision instead of always reading back as `TIME(6)`. No change to 
stored values.
   
   ### How was this patch tested?
   Added `test("time")` to `ArrowUtilsSuite`: round-trip `TIME(p)` for `p` in 
`{0, 3, 6, 9}` preserves `p` (and the Arrow field stays `Time(NANOSECOND, 
64)`); a `Time(NANOSECOND)` field with no precision metadata, or with a 
present-but-invalid precision (out of `[0, 9]` or non-numeric), falls back to 
`TIME(6)`; and the precision key does not leak into the reconstructed column 
`Metadata`. Run with `build/sbt 'catalyst/testOnly *ArrowUtilsSuite'` (8 tests 
pass).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Cursor (Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to