stevomitric opened a new pull request, #56407:
URL: https://github.com/apache/spark/pull/56407
### What changes were proposed in this pull request?
This PR adds read and write support for the nanosecond-capable timestamp
types
`TimestampNTZNanosType(p)` / `TimestampLTZNanosType(p)` (precision `p` in
`[7, 9]`, from the SPIP
[SPARK-56822]) in the built-in Parquet data source, gated behind the
existing preview flag
`spark.sql.timestampNanosTypes.enabled`.
- Schema conversion (`ParquetSchemaConverter`, both directions):
- Write: `TimestampLTZNanosType` / `TimestampNTZNanosType` -> `INT64`
annotated
`TIMESTAMP(NANOS, isAdjustedToUTC)` (`isAdjustedToUTC = true` for LTZ,
`false` for NTZ).
- Read: `INT64` + `TIMESTAMP(NANOS, ...)` -> `TimestampLTZNanosType(9)` /
`TimestampNTZNanosType(9)`.
Parquet's `NANOS` unit carries no precision parameter, so reads mint the
canonical precision 9.
The legacy `spark.sql.legacy.parquet.nanosAsLong` path keeps precedence
and is unchanged.
- Read values (non-vectorized / row-based reader, `ParquetRowConverter`): an
`INT64`
epoch-nanoseconds value is split into `epochMicros = floorDiv(v, 1000)` and
`nanosWithinMicro = floorMod(v, 1000)` and stored as `TimestampNanosVal`.
The LTZ converter rebases
the microsecond component (reusing the existing datetime rebase); the NTZ
converter does not (no
time-zone context), mirroring the existing microsecond-precision arms.
- Write values (`ParquetWriteSupport`): a `TimestampNanosVal` is written as
`INT64`
epoch-nanoseconds using exact arithmetic
(`Math.addExact(Math.multiplyExact(epochMicros, 1000),
nanosWithinMicro)`); values outside the
representable `INT64` epoch-nanosecond range (~1677-09-21 .. 2262-04-11)
fail instead of silently
wrapping.
- The Parquet `supportDataType` guards (V1 `ParquetFileFormat` and V2
`ParquetTable`) are relaxed to
accept the nanos types, and the feature flag is propagated to the read
Hadoop configuration in both
the V1 and V2 paths.
- The nanos types are excluded from `ParquetUtils.isBatchReadSupported`, so
columnar reads
transparently fall back to the row-based reader. Vectorized-reader support
is a follow-up.
Spark-written files round-trip the exact type (including precision) via the
Spark schema stored in the
Parquet key-value metadata; "foreign" files with no Spark metadata (e.g.
produced by
Trino/DuckDB/pandas) derive the nanos type from the Parquet annotation.
### Why are the changes needed?
Nanosecond-precision timestamps are common in data produced by
pandas/PyArrow, Trino, ClickHouse,
DuckDB, and similar systems. Spark currently rejects Parquet `INT64
TIMESTAMP(NANOS)`
(`PARQUET_TYPE_ILLEGAL`), or, with
`spark.sql.legacy.parquet.nanosAsLong=true`, reads it as a raw
`LongType` that drops all timestamp and time-zone semantics. This PR lets
Spark read and write such
data as first-class nanosecond timestamp types, as part of the SPIP
[SPARK-56822] "Timestamps with
nanosecond precision".
### Does this PR introduce _any_ user-facing change?
Yes, behind the preview flag `spark.sql.timestampNanosTypes.enabled`
(default off in production). When
the flag is enabled:
- Parquet files with `INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true/false)`
are read as
`TimestampLTZNanosType(9)` / `TimestampNTZNanosType(9)` instead of being
rejected.
- Columns of these types can be written to Parquet (as `INT64
TIMESTAMP(NANOS)`).
When the flag is off, behavior is unchanged, including the legacy
`spark.sql.legacy.parquet.nanosAsLong` escape hatch.
### How was this patch tested?
New `ParquetTimestampNanosSuite` covering: Spark write/read round-trip
preserving value and precision
at `p` = 7, 8, 9 (vectorized reader on and off); reading "foreign"
`TIMESTAMP(NANOS)` files written
directly via parquet-mr for both NTZ and LTZ, including a pre-epoch
(negative) instant that exercises
floor semantics and nulls; `nanosAsLong` precedence; the disabled-feature
error; an
out-of-`INT64`-range write failing; a nested (array) column round-trip; and
a V2 file-source
round-trip. Existing tests updated: `SPARK-40819` (pin the feature off to
keep asserting the legacy
reject path) and `SPARK-57166` (drop Parquet, which is now supported).
scalastyle passes.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic Claude Opus 4.8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]