MaxGekk opened a new pull request, #56739:
URL: https://github.com/apache/spark/pull/56739

   ### What changes were proposed in this pull request?
   Wire `TimestampNTZNanosType(p)` and `TimestampLTZNanosType(p)` (`p` in `[7, 
9]`) into the Spark <-> Arrow type mapping, the `InternalRow` -> Arrow vector 
writer path, and the Arrow -> `InternalRow` accessor path, following the 
existing Types Framework `Ops` pattern.
   
   - `TimestampNanosTypeApiOps`: `toArrowType` maps NTZ -> 
`Timestamp(NANOSECOND, null)` and LTZ -> `Timestamp(NANOSECOND, sessionTz)` 
(with a null-tz guard like `TimestampType`).
   - `TypeApiOps.fromArrowType`: precision-less fallback mapping 
`Timestamp(NANOSECOND, tz/null)` back to LTZ/NTZ nanos at the canonical max 
precision (`9`).
   - `ArrowUtils`: Arrow's `Timestamp` type has no fractional-second precision 
field, so the exact precision is preserved across the round-trip in the Arrow 
field metadata key `SPARK::timestampNanos::precision` (the same channel Spark 
already uses for Geometry/Geography `srid`), recovered in `fromArrowField`, 
falling back to `9` when absent/invalid. Values are carried verbatim at 
nanosecond resolution.
   - `ArrowWriter`: `TimestampNTZNanosWriter` / `TimestampLTZNanosWriter` pack 
the value into int64 epoch-nanoseconds via the shared 
`DateTimeUtils.timestampNanosToEpochNanos`, raising `DATETIME_OVERFLOW` when 
the value is outside the int64 epoch-nanosecond range.
   - `ArrowColumnVector`: accessors for `TimeStampNanoVector` / 
`TimeStampNanoTZVector` decode the epoch-nanoseconds back into 
`TimestampNanosVal`.
   - The epoch-nanos packer and the overflow error are shared with the Parquet 
INT64 path (SPARK-57100): the packer moved into `DateTimeUtils`, and 
`parquetTimestampNanosOverflowError` was generalized to 
`timestampNanosEpochNanosOverflowError(value, isNtz, sink)`. The Parquet error 
message is unchanged (`sink = "Parquet INT64"`); the Arrow path uses `sink = 
"Arrow INT64"`.
   
   ### Why are the changes needed?
   This is the shared Arrow prerequisite for Spark Connect (parent: 
SPARK-56822) and also benefits the classic Arrow paths (Arrow result transfer, 
`createDataFrame` from Arrow, `mapInArrow`). The Spark <-> Arrow mapping 
(`ArrowUtils`) and the row-to-vector writers (`ArrowWriter`) had no support for 
the nanosecond timestamp types, so any plan whose schema contained them failed 
to serialize.
   
   ### Does this PR introduce _any_ user-facing change?
   No. The types remain gated behind `spark.sql.timestampNanosTypes.enabled`.
   
   ### How was this patch tested?
   - `ArrowUtilsSuite`: precision round-trip for `p` in `{7, 8, 9}` (NTZ and 
LTZ across multiple session zones), null-tz LTZ error, no-metadata fallback to 
`9`, and that the precision key does not leak into the reconstructed column 
`Metadata`.
   - `ArrowWriterSuite`: value round-trip (sub-micro `0`/`999`, pre-epoch 
instants, large boundaries, nulls) and `DATETIME_OVERFLOW` for out-of-range 
values, for both NTZ and LTZ.
   - `ParquetTimestampNanosSuite`: re-run to confirm the shared-helper refactor 
preserves the existing Parquet behavior.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Cursor (Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to