Hi all,

I'd like to start discussion on a SPIP for adding nanosecond-precision
timestamp support to Spark SQL.

SPIP doc:
https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?usp=sharing
JIRA: https://issues.apache.org/jira/browse/SPARK-56159

*Motivation*

Spark's timestamp types are microsecond-precision only. Nanosecond Parquet
files either throw AnalysisException or lose timestamp semantics by falling
back to LongType, losing all timestamp semantics. Iceberg V3's timestamp_ns
/ timestamptz_ns columns are similarly unsupported.


*Proposal*
We propose adding two new singleton types — TimestampNanosType and
TimestampNTZNanosType — extending DatetimeType, stored internally as a Long
representing nanoseconds since the Unix epoch. We also propose TIMESTAMP(p)
parameterized SQL syntax (p = 0–9) that maps to existing singleton types at
parse time.

   - INT64 epoch nanos internal representation, matching Parquet, Arrow,
   Iceberg V3, and Pandas. Trade-off: date range is ~1677–2262 (same as
   Parquet spec).
   - Singleton types, not parameterized case classes. TimestampType remains
   an @Stable singleton — no binary compatibility break.
   - Both LTZ and NTZ variants, required for correct Parquet and Iceberg V3
   interoperability.
   - CHAR/VARCHAR metadata pattern for TIMESTAMP(p) precision enforcement,
   avoiding DecimalType-like complexity.



*Addressing Previous Feedback*
This builds on the previous nanosecond timestamp discussion thread [1].
We've gone through the feedback and concerns raised there, and this
proposal incorporates them:


   1. 10-byte storage and non-standard format → We now use INT64 epoch
   nanos — the de facto standard used by Parquet, Arrow, Iceberg, Avro, and
   Pandas. No Arrow compatibility issue, zero conversion overhead.
   2. Precision overlap and type confusion → Replaced the parameterized
   TimestampNsNTZType(precision) (where precision=6 overlapped with existing
   types) with fixed-precision singletons. TIMESTAMP(p) SQL syntax is
   supported but resolved to singletons at parse time (p=0–6 → micros, p=7–9 →
   nanos).
   3. Schema inference unspecified → Fully specified.
   spark.sql.parquet.inferTimestampNanos.enabled controls Parquet
   inference. Avro's timestamp-nanos and Iceberg V3's timestamp_ns map
   directly. Untyped formats (CSV/JSON) default to microseconds for backward
   compatibility.
   4. NTZ-only vs both LTZ and NTZ → Both variants are needed. Parquet
   defines TIMESTAMP(NANOS, true/false) and Iceberg V3 defines both
   timestamptz_ns and timestamp_ns.
   5. Parameterized type complexity → Singletons, not parameterized case
   classes. Precision enforcement uses the proven CHAR/VARCHAR metadata
   pattern.
   6. Casting rules and type coercion undefined → Complete 7×7 cast matrix
   and coercion rules specified in the doc, with "precision first, then
   timezone" principle and cross-engine comparison.


The SPIP doc contains full details including API changes, internal
representation, cast matrix, type coercion rules, TIMESTAMP(p) design, and
a phased milestone plan.

Looking forward to your feedback.

Thanks,
Xiaoxuan Li

[1] https://lists.apache.org/thread/4r25lhtrg9cog956m2fldodt64dgt45j

Reply via email to