Hi all, I'd like to start discussion on a SPIP for adding nanosecond-precision timestamp support to Spark SQL.
SPIP doc: https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?usp=sharing JIRA: https://issues.apache.org/jira/browse/SPARK-56159 *Motivation* Spark's timestamp types are microsecond-precision only. Nanosecond Parquet files either throw AnalysisException or lose timestamp semantics by falling back to LongType, losing all timestamp semantics. Iceberg V3's timestamp_ns / timestamptz_ns columns are similarly unsupported. *Proposal* We propose adding two new singleton types — TimestampNanosType and TimestampNTZNanosType — extending DatetimeType, stored internally as a Long representing nanoseconds since the Unix epoch. We also propose TIMESTAMP(p) parameterized SQL syntax (p = 0–9) that maps to existing singleton types at parse time. - INT64 epoch nanos internal representation, matching Parquet, Arrow, Iceberg V3, and Pandas. Trade-off: date range is ~1677–2262 (same as Parquet spec). - Singleton types, not parameterized case classes. TimestampType remains an @Stable singleton — no binary compatibility break. - Both LTZ and NTZ variants, required for correct Parquet and Iceberg V3 interoperability. - CHAR/VARCHAR metadata pattern for TIMESTAMP(p) precision enforcement, avoiding DecimalType-like complexity. *Addressing Previous Feedback* This builds on the previous nanosecond timestamp discussion thread [1]. We've gone through the feedback and concerns raised there, and this proposal incorporates them: 1. 10-byte storage and non-standard format → We now use INT64 epoch nanos — the de facto standard used by Parquet, Arrow, Iceberg, Avro, and Pandas. No Arrow compatibility issue, zero conversion overhead. 2. Precision overlap and type confusion → Replaced the parameterized TimestampNsNTZType(precision) (where precision=6 overlapped with existing types) with fixed-precision singletons. TIMESTAMP(p) SQL syntax is supported but resolved to singletons at parse time (p=0–6 → micros, p=7–9 → nanos). 3. Schema inference unspecified → Fully specified. spark.sql.parquet.inferTimestampNanos.enabled controls Parquet inference. Avro's timestamp-nanos and Iceberg V3's timestamp_ns map directly. Untyped formats (CSV/JSON) default to microseconds for backward compatibility. 4. NTZ-only vs both LTZ and NTZ → Both variants are needed. Parquet defines TIMESTAMP(NANOS, true/false) and Iceberg V3 defines both timestamptz_ns and timestamp_ns. 5. Parameterized type complexity → Singletons, not parameterized case classes. Precision enforcement uses the proven CHAR/VARCHAR metadata pattern. 6. Casting rules and type coercion undefined → Complete 7×7 cast matrix and coercion rules specified in the doc, with "precision first, then timezone" principle and cross-engine comparison. The SPIP doc contains full details including API changes, internal representation, cast matrix, type coercion rules, TIMESTAMP(p) design, and a phased milestone plan. Looking forward to your feedback. Thanks, Xiaoxuan Li [1] https://lists.apache.org/thread/4r25lhtrg9cog956m2fldodt64dgt45j
