stevomitric opened a new pull request, #56407:
URL: https://github.com/apache/spark/pull/56407

   ### What changes were proposed in this pull request?
   
   This PR adds read and write support for the nanosecond-capable timestamp 
types
   `TimestampNTZNanosType(p)` / `TimestampLTZNanosType(p)` (precision `p` in 
`[7, 9]`, from the SPIP
   [SPARK-56822]) in the built-in Parquet data source, gated behind the 
existing preview flag
   `spark.sql.timestampNanosTypes.enabled`.
   
   - Schema conversion (`ParquetSchemaConverter`, both directions):
     - Write: `TimestampLTZNanosType` / `TimestampNTZNanosType` -> `INT64` 
annotated
       `TIMESTAMP(NANOS, isAdjustedToUTC)` (`isAdjustedToUTC = true` for LTZ, 
`false` for NTZ).
     - Read: `INT64` + `TIMESTAMP(NANOS, ...)` -> `TimestampLTZNanosType(9)` / 
`TimestampNTZNanosType(9)`.
       Parquet's `NANOS` unit carries no precision parameter, so reads mint the 
canonical precision 9.
       The legacy `spark.sql.legacy.parquet.nanosAsLong` path keeps precedence 
and is unchanged.
   - Read values (non-vectorized / row-based reader, `ParquetRowConverter`): an 
`INT64`
     epoch-nanoseconds value is split into `epochMicros = floorDiv(v, 1000)` and
     `nanosWithinMicro = floorMod(v, 1000)` and stored as `TimestampNanosVal`. 
The LTZ converter rebases
     the microsecond component (reusing the existing datetime rebase); the NTZ 
converter does not (no
     time-zone context), mirroring the existing microsecond-precision arms.
   - Write values (`ParquetWriteSupport`): a `TimestampNanosVal` is written as 
`INT64`
     epoch-nanoseconds using exact arithmetic
     (`Math.addExact(Math.multiplyExact(epochMicros, 1000), 
nanosWithinMicro)`); values outside the
     representable `INT64` epoch-nanosecond range (~1677-09-21 .. 2262-04-11) 
fail instead of silently
     wrapping.
   - The Parquet `supportDataType` guards (V1 `ParquetFileFormat` and V2 
`ParquetTable`) are relaxed to
     accept the nanos types, and the feature flag is propagated to the read 
Hadoop configuration in both
     the V1 and V2 paths.
   - The nanos types are excluded from `ParquetUtils.isBatchReadSupported`, so 
columnar reads
     transparently fall back to the row-based reader. Vectorized-reader support 
is a follow-up.
   
   Spark-written files round-trip the exact type (including precision) via the 
Spark schema stored in the
   Parquet key-value metadata; "foreign" files with no Spark metadata (e.g. 
produced by
   Trino/DuckDB/pandas) derive the nanos type from the Parquet annotation.
   
   ### Why are the changes needed?
   
   Nanosecond-precision timestamps are common in data produced by 
pandas/PyArrow, Trino, ClickHouse,
   DuckDB, and similar systems. Spark currently rejects Parquet `INT64 
TIMESTAMP(NANOS)`
   (`PARQUET_TYPE_ILLEGAL`), or, with 
`spark.sql.legacy.parquet.nanosAsLong=true`, reads it as a raw
   `LongType` that drops all timestamp and time-zone semantics. This PR lets 
Spark read and write such
   data as first-class nanosecond timestamp types, as part of the SPIP 
[SPARK-56822] "Timestamps with
   nanosecond precision".
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, behind the preview flag `spark.sql.timestampNanosTypes.enabled` 
(default off in production). When
   the flag is enabled:
   - Parquet files with `INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true/false)` 
are read as
     `TimestampLTZNanosType(9)` / `TimestampNTZNanosType(9)` instead of being 
rejected.
   - Columns of these types can be written to Parquet (as `INT64 
TIMESTAMP(NANOS)`).
   
   When the flag is off, behavior is unchanged, including the legacy
   `spark.sql.legacy.parquet.nanosAsLong` escape hatch.
   
   ### How was this patch tested?
   
   New `ParquetTimestampNanosSuite` covering: Spark write/read round-trip 
preserving value and precision
   at `p` = 7, 8, 9 (vectorized reader on and off); reading "foreign" 
`TIMESTAMP(NANOS)` files written
   directly via parquet-mr for both NTZ and LTZ, including a pre-epoch 
(negative) instant that exercises
   floor semantics and nulls; `nanosAsLong` precedence; the disabled-feature 
error; an
   out-of-`INT64`-range write failing; a nested (array) column round-trip; and 
a V2 file-source
   round-trip. Existing tests updated: `SPARK-40819` (pin the feature off to 
keep asserting the legacy
   reject path) and `SPARK-57166` (drop Parquet, which is now supported). 
scalastyle passes.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Anthropic Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to