Max Gekk created SPARK-56981:
--------------------------------
Summary: Add physical representation and UnsafeRow support for
nanosecond-capable timestamp types
Key: SPARK-56981
URL: https://issues.apache.org/jira/browse/SPARK-56981
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.2.0
Reporter: Max Gekk
Assignee: Max Gekk
h3. Summary
[PR #55952|https://github.com/apache/spark/pull/55952] / SPARK-56876 added
_logical_ types {{TimestampNTZNanosType(p)}} and {{TimestampLTZNanosType(p)}}
(p ∈ [7, 9]) and JSON metadata. They still map to {{UninitializedPhysicalType}}
in {{PhysicalDataType.apply}}, so the engine cannot store or access values in
{{InternalRow}} / {{UnsafeRow}}.
This issue delivers the _minimum_ physical layer aligned with the merged SPIP
model: *epoch microseconds (8 bytes) + nanoseconds within the microsecond
(0–999, 2 bytes)* — see {{defaultSize = 10}} on the logical types. One shared
unsafe value representation at the row layer is fine for both NTZ and LTZ nanos
types; semantic differences stay in logical/SQL layers.
This is the *unblocker* for downstream work (cast, Parquet, expressions). It is
intentionally small: no SQL parser, no SQLConf preview, no casts, no Parquet,
no {{TypeOps}} / Types Framework requirement.
_Ordering / compare / hash_ for these types is *out of scope* and will be
tracked in a separate follow-up issue.
h3. What to do
*common/unsafe*
* Add {{org.apache.spark.unsafe.types.TimestampNTZNanos}} (name as
implemented): immutable value with {{long}} epoch micros + {{short}}
nanos-in-micro ∈ [0, 999]; {{equals}} / {{hashCode}}.
*PhysicalDataType*
* Add {{PhysicalTimestampNanosType}} with {{InternalType}} = the unsafe value
class.
* Register {{TimestampNTZNanosType}} and {{TimestampLTZNanosType}} in
{{PhysicalDataType.applyDefault}} (no {{UninitializedPhysicalType}}
fall-through).
*InternalRow*
* Add get/set accessors on {{GenericInternalRow}} (and wiring in
{{InternalRow}} accessor dispatch) for the new physical type.
*UnsafeRow*
* Store values using the same pattern as {{PhysicalCalendarIntervalType}}
(non-fixed field: pointer in the 8-byte word + fixed payload), since 10 logical
bytes do not fit a single primitive word.
* Implement read and write on {{UnsafeRow}}; update {{UnsafeRow.isFixedLength}}
/ size estimation if needed.
*Codegen / getters*
* {{SpecializedGettersReader}} and {{CodeGenerator}} read path for
{{PhysicalTimestampNanosType}}; write path included if required for roundtrip
tests or projection writers.
*Literals*
* Extend {{Literal}} validation in {{literals.scala}} to accept the unsafe
value type for nanos timestamp physical type.
h3. Tests
* {{DataTypeSuite}}: {{PhysicalDataType(TimestampNTZNanosType(p))}} and LTZ
variant are not {{UninitializedPhysicalType}}; {{defaultSize}} remains 10.
* New or extended suite: {{InternalRow}} set/get roundtrip for non-null and
null.
* {{UnsafeRow}} write/read roundtrip for a struct with nanos timestamp
column(s).
* Regression: microsecond {{TimestampType}} / {{TimestampNTZType}} unchanged.
h3. Acceptance criteria
* {{PhysicalDataType.apply}} returns a concrete physical type for
{{TimestampNTZNanosType}} and {{TimestampLTZNanosType}} for all valid p ∈ [7,
9].
* Values can be written to and read from {{UnsafeRow}} and
{{GenericInternalRow}} without falling through to uninitialized physical type
or generic unsupported-physical-type failures in tests.
* Codegen and interpreted getters can read a bound column of this physical type
in a minimal projection test.
* No change to behavior of {{TimestampType}}, {{TimestampNTZType}}, or existing
microsecond storage.
* Downstream issues (parser, SQLConf, cast, Parquet) can depend on this issue
and assume the SPIP composite row layout.
h3. References
* Precedent: {{PhysicalCalendarIntervalType}} + {{CalendarInterval}} unsafe type
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]