Hi Xiaoxuan, Thank you for the detailed clarification of your proposal.
> the key difference is internal representation, our draft uses INT64 > epoch-nanos, yours uses composite (epochMicros, nanosOfMicro). I think the main difference between our proposals is how we answer the question: shall Spark SQL conform to the SQL standard or not? The standard says clearly that the year range is from 0001 to 9999. Rough count of distinct nanosecond instants on a proleptic-Gregorian line from 0001‑01‑01 through 9999‑12‑31: - About 3.65*10^6 civil days in that span (order of magnitude is enough). - Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets from midnight. So the number of distinct values is about: N +-= 3.65*10^6 * 8.64*10^13 +-= 3.2*10^20 Then: log2(N) ±= 68-69 bits. Any mapping from that full set would need at least about 69 bits. > Four concerns, and I'd value your read on whether they're solvable: > Composite doesn't fit UnsafeRow's 8-byte slot, so every > sort/hash/join/shuffle pays the variable-length cost: extra memory access, > worse cache locality, ~2–3x memory per value. You are right for UnsafeRows but built-in datasources like Parquet and ORC might return Column Vectors where values are stored as arrays of long, short. And such values could be processed in vectorized ways. I believe the new data type will have worse performance, but not so significant. > The range benefit doesn't survive egress. Spark's main egress paths are all > INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark Connect), > Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. Below are the sources from where timestamps with nanosecond precision could come from out of the range 1677-2262: 1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96 by default which has nanoseconds precision. 2. Another built-in datasource ORC stores timestamps with nanosecond precision, see https://orc.apache.org/specification/ORCv2/ 3. Spark SQL can have access to some external DBMSs that support nanoseconds precision, for instance Oracle, MS SQL Server, Snowflake, Trino, Teradata. > Nanosecond precision tends to go with modern-measurement data (HFT, traces, > IoT, logs); wide calendar range tends to go with archival data where milli or > second precision is enough. I would imagine that Spark users might need timestamps with nanos from out of the range 1677-2262: - Simulating some physical processes in the future or in the past. - Migration from other systems. > Composite is hard to walk back once shipped. The two directions aren't > symmetric. Starting with INT64 and upgrading to composite later is SQL-layer > compatible INT64 epoch-nanos is also a one-way semantic bet in the other direction: once users store physics-time workloads in that encoding, widening later without reinterpretation is not free either. > The other thing that pulled us toward INT64 is that it's the choice most > open-source columnar and lakehouse engines have already made. DuckDB's > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage > all use INT64 epoch-nanos with the 1678–2262 bound. Matching open columnar consensus for wire formats is a strong default for interchange, I agree. I would separate that from the question of Spark’s in-memory representation. > Given the perf concern especially, we'd prefer INT64 for now. @Unstable keeps > the door open to the composite layout later How about measuring performance of MVP on end-to-end benchmarks. We could address perf concerns later. Yours faithfully, Max Gekk On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li <[email protected]> wrote: > > Hi Max, > Thanks for the writeup. I've been working on a related proposal in parallel — > SPIP: Support NanoSecond Timestamp Types. The user-visible surface overlaps a > lot (SQL syntax, new catalyst types, Parquet NANOS interop); the key > difference is internal representation, our draft uses INT64 epoch-nanos, > yours uses composite (epochMicros, nanosOfMicro). > > If we decide to go with composite, I agree your layout is the right one, > reuses micros-based DateTimeUtils, aligns the calendar range with > TimestampType, keeps the extra precision as a small bounded correction. > > We started with INT64 because we're worried about paying composite's cost > without getting the real benefit. Four concerns, and I'd value your read on > whether they're solvable: > > Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte slot, so every > sort/hash/join/shuffle pays the variable-length cost: extra memory access, > worse cache locality, ~2–3x memory per value. Trino is the closest precedent > — they went composite for TIMESTAMP(p>6) because their ceiling is > picoseconds, and even so the perf gap between short and long representations > was significant enough that they added a hive.timestamp-precision toggle so > users could force high-precision columns back to micros. Our ceiling is > nanoseconds, so we'd take on Trino's cost without Trino's reason. Curious how > you see it playing out differently. > The range benefit doesn't survive egress. Spark's main egress paths are all > INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark Connect), > Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A year-1500 value can > live in Spark memory under composite but can't leave — it either throws on > write/fetch or gets silently truncated, depending on how the boundary is > specified. Curious what you have in mind for the egress side. > Do workloads actually need both? Nanosecond precision tends to go with > modern-measurement data (HFT, traces, IoT, logs); wide calendar range tends > to go with archival data where milli or second precision is enough. We > haven't found a case where a single column needs both — same assumption > Parquet, Arrow, Iceberg, and Pandas seem to make. The one case where they do > intersect is sentinel values — 9999-12-31 for "no end date," 0001-01-01 for > "unknown start" — mixed into columns that otherwise hold nanosecond-precise > timestamps. Your proposal handles this natively; ours asks users to either > use NULL, pick a sentinel within range. That's a real user-facing ask. > Curious whether you've seen other patterns, since sentinels alone feel like > something that could also be addressed at the data-modeling layer. > Composite is hard to walk back once shipped. The two directions aren't > symmetric. Starting with INT64 and upgrading to composite later is SQL-layer > compatible — user queries and declared schemas don't move, the existing > Parquet files keep meaning the same thing (Spark just reads INT64 nanos into > composite at the edge), and new writes can carry the wider range once Parquet > or Arrow grow support. Starting with composite is effectively a one-way > commitment: the moment users persist year-1500 values into tables, Spark owns > supporting those values forever, because narrowing the type after the fact > would be data loss from the user's perspective. So starting narrow preserves > the option to go wider if the evidence shifts; starting wide locks in the > cost on day one. > > The other thing that pulled us toward INT64 is that it's the choice most > open-source columnar and lakehouse engines have already made. DuckDB's > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage > all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, Arrow, Iceberg > V3, Avro, and Pandas datetime64[ns] do too. Engines that offer full-range > nanos — Snowflake, Oracle, DB2 — either run on proprietary storage formats > they control end-to-end or are row-based OLTP with different cost structures. > Trino is the one open-source columnar engine that went wider — it supports > TIMESTAMP(p) up to picoseconds (p=12), which simply doesn't fit in INT64, so > composite was necessary. Even so, the performance penalty is real. For a > columnar engine like Spark whose data plane runs through Parquet and Arrow, > matching the open-source columnar consensus seemed like the less surprising > default. > > Given the perf concern especially, we'd prefer INT64 for now. @Unstable keeps > the door open to the composite layout later — if the ecosystem grows > full-range nanos, workloads push us there, or we need sub-nanosecond > precision where INT64 isn't enough. > > Would love any thought on this, good to align in a single direction before > either moves forward. > > Thanks, > Xiaoxuan Li > > On Fri, May 8, 2026 at 1:43 AM Wenchen Fan <[email protected]> wrote: >> >> This new design makes sense to me. So we just add 2 more bytes to store >> nanosOfMicro, and the rest is the same as the current timestamp types: same >> value range, but higher precision. >> >> On Thu, May 7, 2026 at 5:16 PM Max Gekk <[email protected]> wrote: >>> >>> Hi Spark devs, >>> >>> I’d like to share a proposal for nano-second-capable timestamp support >>> and ask for your feedback. >>> >>> Here is the SPIP: >>> https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing >>> >>> My proposal uses a logical split representation: >>> - epochMicros: Long >>> - nanosOfMicro: Short in [0, 999] >>> >>> This applies to both NTZ and LTZ nano-capable types; timezone >>> semantics remain unchanged and are handled at interpretation >>> boundaries (as today). >>> >>> Why this approach? I believe this is the most practical path for Spark >>> because it: >>> 0. Conforms to the SQL standard. >>> 1. Preserves Spark’s existing microsecond approach. Most >>> Catalyst/runtime datetime logic already uses micros. The split model >>> extends it rather than replacing it. >>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine model. A >>> single Long epoch-nanos representation constrains calendar range much >>> more aggressively than Long micros. >>> 3. Keeps migration risk lower. Existing microsecond behavior remains >>> default; nano precision is opt-in via parameterized types/syntax. >>> 4. Allows efficient implementation paths. Internals can still choose >>> compact physical encodings (row/vector/file boundaries), while keeping >>> one canonical logical contract. >>> >>> Related SPIPs considered. I reviewed and compared against these two drafts: >>> - SPIP: Support NanoSecond Timestamps: >>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo >>> - SPIP: Support NanoSecond Timestamp Types: >>> https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il >>> >>> Those drafts are valuable and informed this design. The key difference >>> is that I prioritize micros-first engine continuity with a bounded >>> nano remainder, instead of making epoch-nanos the primary internal >>> semantic unit. >>> In short: I think epochMicros + nanosOfMicro is a better fit for >>> Spark’s current architecture and compatibility constraints, while >>> still delivering practical nanosecond support. >>> >>> Thanks in advance for your feedback. >>> >>> Best regards, >>> Max Gekk >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: [email protected] >>> --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
