MaxGekk opened a new pull request, #56739:
URL: https://github.com/apache/spark/pull/56739
### What changes were proposed in this pull request?
Wire `TimestampNTZNanosType(p)` and `TimestampLTZNanosType(p)` (`p` in `[7,
9]`) into the Spark <-> Arrow type mapping, the `InternalRow` -> Arrow vector
writer path, and the Arrow -> `InternalRow` accessor path, following the
existing Types Framework `Ops` pattern.
- `TimestampNanosTypeApiOps`: `toArrowType` maps NTZ ->
`Timestamp(NANOSECOND, null)` and LTZ -> `Timestamp(NANOSECOND, sessionTz)`
(with a null-tz guard like `TimestampType`).
- `TypeApiOps.fromArrowType`: precision-less fallback mapping
`Timestamp(NANOSECOND, tz/null)` back to LTZ/NTZ nanos at the canonical max
precision (`9`).
- `ArrowUtils`: Arrow's `Timestamp` type has no fractional-second precision
field, so the exact precision is preserved across the round-trip in the Arrow
field metadata key `SPARK::timestampNanos::precision` (the same channel Spark
already uses for Geometry/Geography `srid`), recovered in `fromArrowField`,
falling back to `9` when absent/invalid. Values are carried verbatim at
nanosecond resolution.
- `ArrowWriter`: `TimestampNTZNanosWriter` / `TimestampLTZNanosWriter` pack
the value into int64 epoch-nanoseconds via the shared
`DateTimeUtils.timestampNanosToEpochNanos`, raising `DATETIME_OVERFLOW` when
the value is outside the int64 epoch-nanosecond range.
- `ArrowColumnVector`: accessors for `TimeStampNanoVector` /
`TimeStampNanoTZVector` decode the epoch-nanoseconds back into
`TimestampNanosVal`.
- The epoch-nanos packer and the overflow error are shared with the Parquet
INT64 path (SPARK-57100): the packer moved into `DateTimeUtils`, and
`parquetTimestampNanosOverflowError` was generalized to
`timestampNanosEpochNanosOverflowError(value, isNtz, sink)`. The Parquet error
message is unchanged (`sink = "Parquet INT64"`); the Arrow path uses `sink =
"Arrow INT64"`.
### Why are the changes needed?
This is the shared Arrow prerequisite for Spark Connect (parent:
SPARK-56822) and also benefits the classic Arrow paths (Arrow result transfer,
`createDataFrame` from Arrow, `mapInArrow`). The Spark <-> Arrow mapping
(`ArrowUtils`) and the row-to-vector writers (`ArrowWriter`) had no support for
the nanosecond timestamp types, so any plan whose schema contained them failed
to serialize.
### Does this PR introduce _any_ user-facing change?
No. The types remain gated behind `spark.sql.timestampNanosTypes.enabled`.
### How was this patch tested?
- `ArrowUtilsSuite`: precision round-trip for `p` in `{7, 8, 9}` (NTZ and
LTZ across multiple session zones), null-tz LTZ error, no-metadata fallback to
`9`, and that the precision key does not leak into the reconstructed column
`Metadata`.
- `ArrowWriterSuite`: value round-trip (sub-micro `0`/`999`, pre-epoch
instants, large boundaries, nulls) and `DATETIME_OVERFLOW` for out-of-range
values, for both NTZ and LTZ.
- `ParquetTimestampNanosSuite`: re-run to confirm the shared-helper refactor
preserves the existing Parquet behavior.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]