[PR] [SPARK-57735][SQL] Support nanosecond-precision timestamp types in the in-memory columnar cache [spark]

via GitHub Sat, 27 Jun 2026 20:59:34 -0700


viirya opened a new pull request, #56842:
URL: https://github.com/apache/spark/pull/56842


   ### What changes were proposed in this pull request?
   
   The default in-memory columnar cache serializer 
(`DefaultCachedBatchSerializer`) did not support `TimestampNTZNanosType` / 
`TimestampLTZNanosType`. Caching a DataFrame with such a column failed at 
materialization with `not support type: TimestampNTZNanosType(9)`, because none 
of the cache's type-dispatch sites had a case for them.
   
   This adds full support, following the fixed-width multi-field pattern 
already used by `CalendarInterval`. The physical value `TimestampNanosVal` is a 
fixed 16-byte payload (an 8-byte `epochMicros` plus an 8-byte word holding 
`nanosWithinMicro`), so it maps cleanly onto that pattern:
   
   - **`ColumnType`**: a `TIMESTAMP_NANOS` column type (with 
`TIMESTAMP_NTZ_NANOS` / `TIMESTAMP_LTZ_NANOS` singletons) whose 
`append`/`extract` read and write the 16-byte payload, with a 
`MutableUnsafeRow` direct-copy fast path.
   - **`ColumnBuilder`, `ColumnAccessor`**: builder and accessor classes plus 
dispatch cases.
   - **`ColumnStats`**: a `TimestampNanosColumnStats` collector (fixed size, no 
min/max bounds).
   - **`GenerateColumnAccessor`**: the codegen accessor-class selection and 
initialization branch.
   
   `TIMESTAMP_NTZ` and `TIMESTAMP_LTZ` nanos types share the same storage and 
differ only by physical type and row getter/setter, so the encode/decode logic 
is shared between them.
   
   ### Why are the changes needed?
   
   Nanosecond-precision timestamp types are otherwise unsupported by the cache, 
so `df.cache()` on a column of these types throws. With this change such 
DataFrames cache and read back correctly, consistent with the microsecond 
`TIMESTAMP_NTZ` / `TIMESTAMP` types which the cache already supports.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Previously, caching a DataFrame containing a `TIMESTAMP_NTZ(p)` / 
`TIMESTAMP_LTZ(p)` column with `p` in the nanosecond range threw `not support 
type`. Now it caches and reads back the values, including sub-microsecond 
precision.
   
   ### How was this patch tested?
   
   - `ColumnTypeSuite`: append/extract round-trip for `TIMESTAMP_NTZ_NANOS` and 
`TIMESTAMP_LTZ_NANOS` (random values), plus `defaultSize` checks.
   - `InMemoryColumnarQuerySuite`: an end-to-end cache roundtrip for both nanos 
types, with the vectorized reader both on and off, covering sub-microsecond 
precision and null values.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57735][SQL] Support nanosecond-precision timestamp types in the in-memory columnar cache [spark]

Reply via email to