Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Wenchen Fan Wed, 13 May 2026 02:40:51 -0700

Sorry, I misclicked the send button, let me finish.

We can throw out of range errors if the actual timestamp value does not fit
the Parquet parquet INT64, and we can work with the Parquet and other data
format communities to add support for timestamp nanos with a wider year
range. Before that, we can write a custom struct in Parquet to save this
timestamp nano type.


On Wed, May 13, 2026 at 5:38 PM Wenchen Fan <[email protected]> wrote:

> I think the main question is what are the requirements for this new
> timestamp nano type. Personally I think it's better to follow SQL standard,
> and support year range 0000 to 9999. This kills the INT64 option. For data
> sources, we can throw out of range error of the actual timestamp value does
> not fix the Parquet parquet INT64
>
> On Tue, May 12, 2026 at 5:38 PM Max Gekk <[email protected]> wrote:
>
>> Hi Xiaoxuan,
>>
>> Thank you for the detailed clarification of your proposal.
>>
>> > the key difference is internal representation, our draft uses INT64
>> epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
>>
>> I think the main difference between our proposals is how we answer the
>> question: shall Spark SQL conform to the SQL standard or not? The
>> standard says clearly that the year range is from 0001 to 9999. Rough
>> count of distinct nanosecond instants on a proleptic-Gregorian line
>> from 0001‑01‑01 through 9999‑12‑31:
>> - About 3.65*10^6 civil days in that span (order of magnitude is enough).
>> - Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets
>> from midnight.
>> So the number of distinct values is about: N +-= 3.65*10^6 *
>> 8.64*10^13 +-= 3.2*10^20
>> Then: log2(N) ±= 68-69 bits.
>> Any mapping from that full set would need at least about 69 bits.
>>
>> > Four concerns, and I'd value your read on whether they're solvable:
>> > Composite doesn't fit UnsafeRow's 8-byte slot, so every
>> sort/hash/join/shuffle pays the variable-length cost: extra memory access,
>> worse cache locality, ~2–3x memory per value.
>>
>> You are right for UnsafeRows but built-in datasources like Parquet and
>> ORC might return Column Vectors where values are stored as arrays of
>> long, short. And such values could be processed in vectorized ways. I
>> believe the new data type will have worse performance, but not so
>> significant.
>>
>> > The range benefit doesn't survive egress. Spark's main egress paths are
>> all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark Connect),
>> Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns].
>>
>> Below are the sources from where timestamps with nanosecond precision
>> could come from out of the range 1677-2262:
>> 1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96 by
>> default which has nanoseconds precision.
>> 2. Another built-in datasource ORC stores timestamps with nanosecond
>> precision, see https://orc.apache.org/specification/ORCv2/
>> 3. Spark SQL can have access to some external DBMSs that support
>> nanoseconds precision, for instance Oracle, MS SQL Server, Snowflake,
>> Trino, Teradata.
>>
>> > Nanosecond precision tends to go with modern-measurement data (HFT,
>> traces, IoT, logs); wide calendar range tends to go with archival data
>> where milli or second precision is enough.
>>
>> I would imagine that Spark users might need timestamps with nanos from
>> out of the range 1677-2262:
>> - Simulating some physical processes in the future or in the past.
>> - Migration from other systems.
>>
>> > Composite is hard to walk back once shipped. The two directions aren't
>> symmetric. Starting with INT64 and upgrading to composite later is
>> SQL-layer compatible
>>
>> INT64 epoch-nanos is also a one-way semantic bet in the other
>> direction: once users store physics-time workloads in that encoding,
>> widening later without reinterpretation is not free either.
>>
>> > The other thing that pulled us toward INT64 is that it's the choice
>> most open-source columnar and lakehouse engines have already made. DuckDB's
>> TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage
>> all use INT64 epoch-nanos with the 1678–2262 bound.
>>
>> Matching open columnar consensus for wire formats is a strong default
>> for interchange, I agree. I would separate that from the question of
>> Spark’s in-memory representation.
>>
>> > Given the perf concern especially, we'd prefer INT64 for now. @Unstable
>> keeps the door open to the composite layout later
>>
>> How about measuring performance of MVP on end-to-end benchmarks. We
>> could address perf concerns later.
>>
>> Yours faithfully,
>> Max Gekk
>>
>>
>> On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li <[email protected]>
>> wrote:
>> >
>> > Hi Max,
>> > Thanks for the writeup. I've been working on a related proposal in
>> parallel — SPIP: Support NanoSecond Timestamp Types. The user-visible
>> surface overlaps a lot (SQL syntax, new catalyst types, Parquet NANOS
>> interop); the key difference is internal representation, our draft uses
>> INT64 epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
>> >
>> > If we decide to go with composite, I agree your layout is the right
>> one, reuses micros-based DateTimeUtils, aligns the calendar range with
>> TimestampType, keeps the extra precision as a small bounded correction.
>> >
>> > We started with INT64 because we're worried about paying composite's
>> cost without getting the real benefit. Four concerns, and I'd value your
>> read on whether they're solvable:
>> >
>> > Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte slot, so
>> every sort/hash/join/shuffle pays the variable-length cost: extra memory
>> access, worse cache locality, ~2–3x memory per value. Trino is the closest
>> precedent — they went composite for TIMESTAMP(p>6) because their ceiling is
>> picoseconds, and even so the perf gap between short and long
>> representations was significant enough that they added a
>> hive.timestamp-precision toggle so users could force high-precision columns
>> back to micros. Our ceiling is nanoseconds, so we'd take on Trino's cost
>> without Trino's reason. Curious how you see it playing out differently.
>> > The range benefit doesn't survive egress. Spark's main egress paths are
>> all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark Connect),
>> Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A year-1500 value can
>> live in Spark memory under composite but can't leave — it either throws on
>> write/fetch or gets silently truncated, depending on how the boundary is
>> specified. Curious what you have in mind for the egress side.
>> > Do workloads actually need both? Nanosecond precision tends to go with
>> modern-measurement data (HFT, traces, IoT, logs); wide calendar range tends
>> to go with archival data where milli or second precision is enough. We
>> haven't found a case where a single column needs both — same assumption
>> Parquet, Arrow, Iceberg, and Pandas seem to make. The one case where they
>> do intersect is sentinel values — 9999-12-31 for "no end date," 0001-01-01
>> for "unknown start" — mixed into columns that otherwise hold
>> nanosecond-precise timestamps. Your proposal handles this natively; ours
>> asks users to either use NULL, pick a sentinel within range. That's a real
>> user-facing ask. Curious whether you've seen other patterns, since
>> sentinels alone feel like something that could also be addressed at the
>> data-modeling layer.
>> > Composite is hard to walk back once shipped. The two directions aren't
>> symmetric. Starting with INT64 and upgrading to composite later is
>> SQL-layer compatible — user queries and declared schemas don't move, the
>> existing Parquet files keep meaning the same thing (Spark just reads INT64
>> nanos into composite at the edge), and new writes can carry the wider range
>> once Parquet or Arrow grow support. Starting with composite is effectively
>> a one-way commitment: the moment users persist year-1500 values into
>> tables, Spark owns supporting those values forever, because narrowing the
>> type after the fact would be data loss from the user's perspective. So
>> starting narrow preserves the option to go wider if the evidence shifts;
>> starting wide locks in the cost on day one.
>> >
>> > The other thing that pulled us toward INT64 is that it's the choice
>> most open-source columnar and lakehouse engines have already made. DuckDB's
>> TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp storage
>> all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, Arrow, Iceberg
>> V3, Avro, and Pandas datetime64[ns] do too. Engines that offer full-range
>> nanos — Snowflake, Oracle, DB2 — either run on proprietary storage formats
>> they control end-to-end or are row-based OLTP with different cost
>> structures. Trino is the one open-source columnar engine that went wider —
>> it supports TIMESTAMP(p) up to picoseconds (p=12), which simply doesn't fit
>> in INT64, so composite was necessary. Even so, the performance penalty is
>> real. For a columnar engine like Spark whose data plane runs through
>> Parquet and Arrow, matching the open-source columnar consensus seemed like
>> the less surprising default.
>> >
>> > Given the perf concern especially, we'd prefer INT64 for now. @Unstable
>> keeps the door open to the composite layout later — if the ecosystem grows
>> full-range nanos, workloads push us there, or we need sub-nanosecond
>> precision where INT64 isn't enough.
>> >
>> > Would love any thought on this, good to align in a single direction
>> before either moves forward.
>> >
>> > Thanks,
>> > Xiaoxuan Li
>> >
>> > On Fri, May 8, 2026 at 1:43 AM Wenchen Fan <[email protected]> wrote:
>> >>
>> >> This new design makes sense to me. So we just add 2 more bytes to
>> store nanosOfMicro, and the rest is the same as the current timestamp
>> types: same value range, but higher precision.
>> >>
>> >> On Thu, May 7, 2026 at 5:16 PM Max Gekk <[email protected]> wrote:
>> >>>
>> >>> Hi Spark devs,
>> >>>
>> >>> I’d like to share a proposal for nano-second-capable timestamp support
>> >>> and ask for your feedback.
>> >>>
>> >>> Here is the SPIP:
>> >>>
>> https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing
>> >>>
>> >>> My proposal uses a logical split representation:
>> >>> - epochMicros: Long
>> >>> - nanosOfMicro: Short in [0, 999]
>> >>>
>> >>> This applies to both NTZ and LTZ nano-capable types; timezone
>> >>> semantics remain unchanged and are handled at interpretation
>> >>> boundaries (as today).
>> >>>
>> >>> Why this approach? I believe this is the most practical path for Spark
>> >>> because it:
>> >>> 0. Conforms to the SQL standard.
>> >>> 1. Preserves Spark’s existing microsecond approach. Most
>> >>> Catalyst/runtime datetime logic already uses micros. The split model
>> >>> extends it rather than replacing it.
>> >>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine model. A
>> >>> single Long epoch-nanos representation constrains calendar range much
>> >>> more aggressively than Long micros.
>> >>> 3. Keeps migration risk lower. Existing microsecond behavior remains
>> >>> default; nano precision is opt-in via parameterized types/syntax.
>> >>> 4. Allows efficient implementation paths. Internals can still choose
>> >>> compact physical encodings (row/vector/file boundaries), while keeping
>> >>> one canonical logical contract.
>> >>>
>> >>> Related SPIPs considered. I reviewed and compared against these two
>> drafts:
>> >>> - SPIP: Support NanoSecond Timestamps:
>> >>>
>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo
>> >>> - SPIP: Support NanoSecond Timestamp Types:
>> >>>
>> https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il
>> >>>
>> >>> Those drafts are valuable and informed this design. The key difference
>> >>> is that I prioritize micros-first engine continuity with a bounded
>> >>> nano remainder, instead of making epoch-nanos the primary internal
>> >>> semantic unit.
>> >>> In short: I think epochMicros + nanosOfMicro is a better fit for
>> >>> Spark’s current architecture and compatibility constraints, while
>> >>> still delivering practical nanosecond support.
>> >>>
>> >>> Thanks in advance for your feedback.
>> >>>
>> >>> Best regards,
>> >>> Max Gekk
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: [email protected]
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [DISCUSS] SPIP: Nano-second timestamps: micros + nanos of micro

Reply via email to