Fair enough, I was not aware of elf the Java limitation and resulting 
dependency.

On May 13, 2026, at 10:28 AM, Max Gekk <[email protected]> wrote:

Hi Serge,

> If we agree that any performance (and memory) cliff is going composite and 
> not whether the extra bytes are 2 or 4 bytes, then would it make sense to 
> match Trino? We would:

If we would support picosecond precisions, this could cause the following 
issues, IMHO:
1. Spark's datetime stack today is “nanos‑native,” not “picos‑native.”
java.time (Instant, LocalDateTime, ZonedDateTime, Duration, etc.) exposes 
nanoseconds as the finest supported unit in the public model. Supporting p > 9 
in Spark SQL means either rounding away picos at almost every boundary or 
building custom arithmetic, normalization, parsing, and calendar logic for the 
sub‑nano tail. That is a large, long‑lived surface area, with high regression 
risk anywhere we already struggle: LTZ vs NTZ, session time zone, legacy 
rebasing, Julian/Gregorian, pushdown, codegen, etc. So "same cost as going 
composite for nanos" does not imply "picos are free once we went composite."
2. Memory is not only “+2 vs +4 bytes” — it is “+delta bytes * row width * 
shuffle fanout.”
Picos widen rows further than nanos, which increases OOM / GC / shuffle spill 
risk on the same heap and cluster sizes — especially for wide fact tables and 
skewed joins on timestamp keys.
3. Interchange and “federation” still do not become automatic.
Even if Trino is aligned internally, Parquet / Arrow / Pandas / JDBC paths 
overwhelmingly standardize on nanos at best for compact physical encodings.

Best regards,
Max Gekk

On Wed, May 13, 2026 at 4:04 PM serge rielau.com<http://rielau.com/> 
<[email protected]<mailto:[email protected]>> wrote:
>
> A few questions to ponder:
>
> Are we committed to the SQL Standard, even when it may be tactically 
> inconvenient?
> Why did Trino and Db2 go to pico? I can answer for Db2 as I was in the room: 
> We wanted to build for the future and rip the band aid and there was no extra 
> design or QA cost. What was Trino’s thinking?
> In my career I have seen DBMS needs go from milli to micro to nano. Nano will 
> not be the end of it. While for all intents and purposes “antique” 
> nanoseconds are too esoteric to sweat about, sticking with int64 will not be 
> an option for pico.
> Storage is data at rest. It is “easy” to add another format. Engines like 
> Spark outlive storage formats, and so do their APIs.
>
> If we agree that any performance (and memory) cliff is going composite and 
> not whether the extra bytes are 2 or 4 bytes, then would it make sense to 
> match Trino? We would:
>
> Have an actual external benefit outside of the corner case of range
> Peace of mind for the API for at least a decade, perhaps more (if we go Femto 
> .. which is free upgrade at 4 bytes)
> Full compatibility with any federated datasource
> Standard compliance
>
>
>
>
> On May 13, 2026, at 2:40 AM, Wenchen Fan 
> <[email protected]<mailto:[email protected]>> wrote:
>
> Sorry, I misclicked the send button, let me finish.
>
> We can throw out of range errors if the actual timestamp value does not fit 
> the Parquet parquet INT64, and we can work with the Parquet and other data 
> format communities to add support for timestamp nanos with a wider year 
> range. Before that, we can write a custom struct in Parquet to save this 
> timestamp nano type.
>
> On Wed, May 13, 2026 at 5:38 PM Wenchen Fan 
> <[email protected]<mailto:[email protected]>> wrote:
>>
>> I think the main question is what are the requirements for this new 
>> timestamp nano type. Personally I think it's better to follow SQL standard, 
>> and support year range 0000 to 9999. This kills the INT64 option. For data 
>> sources, we can throw out of range error of the actual timestamp value does 
>> not fix the Parquet parquet INT64
>>
>> On Tue, May 12, 2026 at 5:38 PM Max Gekk 
>> <[email protected]<mailto:[email protected]>> wrote:
>>>
>>> Hi Xiaoxuan,
>>>
>>> Thank you for the detailed clarification of your proposal.
>>>
>>> > the key difference is internal representation, our draft uses INT64 
>>> > epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
>>>
>>> I think the main difference between our proposals is how we answer the
>>> question: shall Spark SQL conform to the SQL standard or not? The
>>> standard says clearly that the year range is from 0001 to 9999. Rough
>>> count of distinct nanosecond instants on a proleptic-Gregorian line
>>> from 0001‑01‑01 through 9999‑12‑31:
>>> - About 3.65*10^6 civil days in that span (order of magnitude is enough).
>>> - Each day has 86400*10^9 = 8.64*10^13 distinct nanosecond offsets
>>> from midnight.
>>> So the number of distinct values is about: N +-= 3.65*10^6 *
>>> 8.64*10^13 +-= 3.2*10^20
>>> Then: log2(N) ±= 68-69 bits.
>>> Any mapping from that full set would need at least about 69 bits.
>>>
>>> > Four concerns, and I'd value your read on whether they're solvable:
>>> > Composite doesn't fit UnsafeRow's 8-byte slot, so every 
>>> > sort/hash/join/shuffle pays the variable-length cost: extra memory 
>>> > access, worse cache locality, ~2–3x memory per value.
>>>
>>> You are right for UnsafeRows but built-in datasources like Parquet and
>>> ORC might return Column Vectors where values are stored as arrays of
>>> long, short. And such values could be processed in vectorized ways. I
>>> believe the new data type will have worse performance, but not so
>>> significant.
>>>
>>> > The range benefit doesn't survive egress. Spark's main egress paths are 
>>> > all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark 
>>> > Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns].
>>>
>>> Below are the sources from where timestamps with nanosecond precision
>>> could come from out of the range 1677-2262:
>>> 1. Parquet: Spark's TIMESTAMP_LTZ is still saved/loaded from INT96 by
>>> default which has nanoseconds precision.
>>> 2. Another built-in datasource ORC stores timestamps with nanosecond
>>> precision, see https://orc.apache.org/specification/ORCv2/
>>> 3. Spark SQL can have access to some external DBMSs that support
>>> nanoseconds precision, for instance Oracle, MS SQL Server, Snowflake,
>>> Trino, Teradata.
>>>
>>> > Nanosecond precision tends to go with modern-measurement data (HFT, 
>>> > traces, IoT, logs); wide calendar range tends to go with archival data 
>>> > where milli or second precision is enough.
>>>
>>> I would imagine that Spark users might need timestamps with nanos from
>>> out of the range 1677-2262:
>>> - Simulating some physical processes in the future or in the past.
>>> - Migration from other systems.
>>>
>>> > Composite is hard to walk back once shipped. The two directions aren't 
>>> > symmetric. Starting with INT64 and upgrading to composite later is 
>>> > SQL-layer compatible
>>>
>>> INT64 epoch-nanos is also a one-way semantic bet in the other
>>> direction: once users store physics-time workloads in that encoding,
>>> widening later without reinterpretation is not free either.
>>>
>>> > The other thing that pulled us toward INT64 is that it's the choice most 
>>> > open-source columnar and lakehouse engines have already made. DuckDB's 
>>> > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp 
>>> > storage all use INT64 epoch-nanos with the 1678–2262 bound.
>>>
>>> Matching open columnar consensus for wire formats is a strong default
>>> for interchange, I agree. I would separate that from the question of
>>> Spark’s in-memory representation.
>>>
>>> > Given the perf concern especially, we'd prefer INT64 for now. @Unstable 
>>> > keeps the door open to the composite layout later
>>>
>>> How about measuring performance of MVP on end-to-end benchmarks. We
>>> could address perf concerns later.
>>>
>>> Yours faithfully,
>>> Max Gekk
>>>
>>>
>>> On Tue, May 12, 2026 at 1:52 AM Xiaoxuan Li 
>>> <[email protected]<mailto:[email protected]>> wrote:
>>> >
>>> > Hi Max,
>>> > Thanks for the writeup. I've been working on a related proposal in 
>>> > parallel — SPIP: Support NanoSecond Timestamp Types. The user-visible 
>>> > surface overlaps a lot (SQL syntax, new catalyst types, Parquet NANOS 
>>> > interop); the key difference is internal representation, our draft uses 
>>> > INT64 epoch-nanos, yours uses composite (epochMicros, nanosOfMicro).
>>> >
>>> > If we decide to go with composite, I agree your layout is the right one, 
>>> > reuses micros-based DateTimeUtils, aligns the calendar range with 
>>> > TimestampType, keeps the extra precision as a small bounded correction.
>>> >
>>> > We started with INT64 because we're worried about paying composite's cost 
>>> > without getting the real benefit. Four concerns, and I'd value your read 
>>> > on whether they're solvable:
>>> >
>>> > Hot-path performance. Composite doesn't fit UnsafeRow's 8-byte slot, so 
>>> > every sort/hash/join/shuffle pays the variable-length cost: extra memory 
>>> > access, worse cache locality, ~2–3x memory per value. Trino is the 
>>> > closest precedent — they went composite for TIMESTAMP(p>6) because their 
>>> > ceiling is picoseconds, and even so the perf gap between short and long 
>>> > representations was significant enough that they added a 
>>> > hive.timestamp-precision toggle so users could force high-precision 
>>> > columns back to micros. Our ceiling is nanoseconds, so we'd take on 
>>> > Trino's cost without Trino's reason. Curious how you see it playing out 
>>> > differently.
>>> > The range benefit doesn't survive egress. Spark's main egress paths are 
>>> > all INT64 epoch-nanos: Parquet NANOS, Arrow (so PySpark and Spark 
>>> > Connect), Iceberg V3 timestamp_ns, Avro, Pandas datetime64[ns]. A 
>>> > year-1500 value can live in Spark memory under composite but can't leave 
>>> > — it either throws on write/fetch or gets silently truncated, depending 
>>> > on how the boundary is specified. Curious what you have in mind for the 
>>> > egress side.
>>> > Do workloads actually need both? Nanosecond precision tends to go with 
>>> > modern-measurement data (HFT, traces, IoT, logs); wide calendar range 
>>> > tends to go with archival data where milli or second precision is enough. 
>>> > We haven't found a case where a single column needs both — same 
>>> > assumption Parquet, Arrow, Iceberg, and Pandas seem to make. The one case 
>>> > where they do intersect is sentinel values — 9999-12-31 for "no end 
>>> > date," 0001-01-01 for "unknown start" — mixed into columns that otherwise 
>>> > hold nanosecond-precise timestamps. Your proposal handles this natively; 
>>> > ours asks users to either use NULL, pick a sentinel within range. That's 
>>> > a real user-facing ask. Curious whether you've seen other patterns, since 
>>> > sentinels alone feel like something that could also be addressed at the 
>>> > data-modeling layer.
>>> > Composite is hard to walk back once shipped. The two directions aren't 
>>> > symmetric. Starting with INT64 and upgrading to composite later is 
>>> > SQL-layer compatible — user queries and declared schemas don't move, the 
>>> > existing Parquet files keep meaning the same thing (Spark just reads 
>>> > INT64 nanos into composite at the edge), and new writes can carry the 
>>> > wider range once Parquet or Arrow grow support. Starting with composite 
>>> > is effectively a one-way commitment: the moment users persist year-1500 
>>> > values into tables, Spark owns supporting those values forever, because 
>>> > narrowing the type after the fact would be data loss from the user's 
>>> > perspective. So starting narrow preserves the option to go wider if the 
>>> > evidence shifts; starting wide locks in the cost on day one.
>>> >
>>> > The other thing that pulled us toward INT64 is that it's the choice most 
>>> > open-source columnar and lakehouse engines have already made. DuckDB's 
>>> > TIMESTAMP_NS, ClickHouse's DateTime64(9), and InfluxDB's timestamp 
>>> > storage all use INT64 epoch-nanos with the 1678–2262 bound. Parquet, 
>>> > Arrow, Iceberg V3, Avro, and Pandas datetime64[ns] do too. Engines that 
>>> > offer full-range nanos — Snowflake, Oracle, DB2 — either run on 
>>> > proprietary storage formats they control end-to-end or are row-based OLTP 
>>> > with different cost structures. Trino is the one open-source columnar 
>>> > engine that went wider — it supports TIMESTAMP(p) up to picoseconds 
>>> > (p=12), which simply doesn't fit in INT64, so composite was necessary. 
>>> > Even so, the performance penalty is real. For a columnar engine like 
>>> > Spark whose data plane runs through Parquet and Arrow, matching the 
>>> > open-source columnar consensus seemed like the less surprising default.
>>> >
>>> > Given the perf concern especially, we'd prefer INT64 for now. @Unstable 
>>> > keeps the door open to the composite layout later — if the ecosystem 
>>> > grows full-range nanos, workloads push us there, or we need 
>>> > sub-nanosecond precision where INT64 isn't enough.
>>> >
>>> > Would love any thought on this, good to align in a single direction 
>>> > before either moves forward.
>>> >
>>> > Thanks,
>>> > Xiaoxuan Li
>>> >
>>> > On Fri, May 8, 2026 at 1:43 AM Wenchen Fan 
>>> > <[email protected]<mailto:[email protected]>> wrote:
>>> >>
>>> >> This new design makes sense to me. So we just add 2 more bytes to store 
>>> >> nanosOfMicro, and the rest is the same as the current timestamp types: 
>>> >> same value range, but higher precision.
>>> >>
>>> >> On Thu, May 7, 2026 at 5:16 PM Max Gekk 
>>> >> <[email protected]<mailto:[email protected]>> wrote:
>>> >>>
>>> >>> Hi Spark devs,
>>> >>>
>>> >>> I’d like to share a proposal for nano-second-capable timestamp support
>>> >>> and ask for your feedback.
>>> >>>
>>> >>> Here is the SPIP:
>>> >>> https://docs.google.com/document/d/1DeW15QueI4PdRyPm6C6jsTZFmIjbXX2j4h-Ja5W_fsg/edit?usp=sharing
>>> >>>
>>> >>> My proposal uses a logical split representation:
>>> >>> - epochMicros: Long
>>> >>> - nanosOfMicro: Short in [0, 999]
>>> >>>
>>> >>> This applies to both NTZ and LTZ nano-capable types; timezone
>>> >>> semantics remain unchanged and are handled at interpretation
>>> >>> boundaries (as today).
>>> >>>
>>> >>> Why this approach? I believe this is the most practical path for Spark
>>> >>> because it:
>>> >>> 0. Conforms to the SQL standard.
>>> >>> 1. Preserves Spark’s existing microsecond approach. Most
>>> >>> Catalyst/runtime datetime logic already uses micros. The split model
>>> >>> extends it rather than replacing it.
>>> >>> 2. Avoids INT64 epoch-nanos range cliff as the primary engine model. A
>>> >>> single Long epoch-nanos representation constrains calendar range much
>>> >>> more aggressively than Long micros.
>>> >>> 3. Keeps migration risk lower. Existing microsecond behavior remains
>>> >>> default; nano precision is opt-in via parameterized types/syntax.
>>> >>> 4. Allows efficient implementation paths. Internals can still choose
>>> >>> compact physical encodings (row/vector/file boundaries), while keeping
>>> >>> one canonical logical contract.
>>> >>>
>>> >>> Related SPIPs considered. I reviewed and compared against these two 
>>> >>> drafts:
>>> >>> - SPIP: Support NanoSecond Timestamps:
>>> >>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?tab=t.0#heading=h.4kibaxwtx2xo
>>> >>> - SPIP: Support NanoSecond Timestamp Types:
>>> >>> https://docs.google.com/document/d/1Q5u1whAO_KcT6d4dFFaIMy_S3RoQEo4Znwz2U-nbhls/edit?tab=t.0#heading=h.xk16mmomv6il
>>> >>>
>>> >>> Those drafts are valuable and informed this design. The key difference
>>> >>> is that I prioritize micros-first engine continuity with a bounded
>>> >>> nano remainder, instead of making epoch-nanos the primary internal
>>> >>> semantic unit.
>>> >>> In short: I think epochMicros + nanosOfMicro is a better fit for
>>> >>> Spark’s current architecture and compatibility constraints, while
>>> >>> still delivering practical nanosecond support.
>>> >>>
>>> >>> Thanks in advance for your feedback.
>>> >>>
>>> >>> Best regards,
>>> >>> Max Gekk
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe e-mail: 
>>> >>> [email protected]<mailto:[email protected]>
>>> >>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: 
>>> [email protected]<mailto:[email protected]>
>>>
>

Reply via email to