Max Gekk created SPARK-57101:
--------------------------------
Summary: Register nanosecond timestamp types in the Types
Framework (server-side)
Key: SPARK-57101
URL: https://issues.apache.org/jira/browse/SPARK-57101
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h3. Summary
Register TimestampNTZNanosType(p) and TimestampLTZNanosType(p) (p in [7, 9]) in
the Spark SQL Types Framework (SPARK-53504) for server-side (catalyst)
operations. Logical types and the physical row layer already exist
(SPARK-56876, SPARK-56981); today these types are wired only through legacy
dispatch in PhysicalDataType, Literal, InternalRow, and codegen. This issue
centralizes that wiring behind TypeOps when spark.sql.types.framework.enabled
is true.
This issue covers physical representation, literals, row accessors, and codegen
class selection only. java.time conversion, Dataset encoders, Connect proto,
Arrow, and cast formatting are out of scope and will be handled in follow-up
issues after SPARK-57033 and related work land.
h3. Background
* Parent SPIP: SPARK-56822 (Timestamps with nanosecond precision)
* Types Framework: SPARK-53504; reference implementation is TimeTypeOps /
TimeTypeApiOps
* Merged foundation:
** SPARK-56876 — logical types TimestampNTZNanosType / TimestampLTZNanosType
** SPARK-56981 — physical value TimestampNanosVal,
PhysicalTimestampNTZNanosType / PhysicalTimestampLTZNanosType, InternalRow and
UnsafeRow accessors (PR #56059)
* Internal representation: epochMicros (long) + nanosWithinMicro (short,
0–999), stored as TimestampNanosVal in rows
h3. What to do
*Add TypeOps implementations (sql/catalyst)*
* Create TimestampNTZNanosTypeOps and TimestampLTZNanosTypeOps (shared base for
common logic), following the TimeTypeOps pattern.
* Register both in TypeOps.apply() — single registration point alongside
TimeType.
*Implement TypeOps methods using existing 56981 behavior:*
|| Method || Behavior ||
| getPhysicalType | PhysicalTimestampNTZNanosType or
PhysicalTimestampLTZNanosType |
| getJavaClass | classOf[TimestampNanosVal] |
| getRowWriter | setTimestampNTZNanos / setTimestampLTZNanos on InternalRow |
| getDefaultLiteral | Literal.create(TimestampNanosVal.ZERO, type) |
| getJavaLiteral | Java literal for codegen (e.g. TimestampNanosVal.ZERO or
fromParts) |
| getMutableValue | Mutable holder for TimestampNanosVal in SpecificInternalRow
(new MutableTimestampNanos or equivalent; avoid unnecessary MutableAny
fallback) |
*Add minimal TypeApiOps stubs (sql/api)*
* Create TimestampNTZNanosTypeApiOps and TimestampLTZNanosTypeApiOps registered
in TypeApiOps.apply().
* TimestampNTZNanosTypeOps / TimestampLTZNanosTypeOps extend the corresponding
ApiOps class and TypeOps (same pattern as TimeTypeOps extends TimeTypeApiOps).
* format / formatUTF8 / toSQLValue: interim implementation acceptable (e.g.
epoch-micros-based display or TimestampNanosVal.toString) until dedicated FSP
formatters exist in a follow-up issue.
* getEncoder: not in scope for this issue.
*Integration points (automatic when TypeOps returns Some)*
These call sites already delegate to TypeOps(dt).map(...).getOrElse(legacy); no
per-call-site edits should be required beyond registration:
* PhysicalDataType.apply
* Literal.default
* InternalRow.getWriter
* CodeGenerator / EncoderUtils Java class for codegen
* SpecificInternalRow mutable column values
*Feature flag*
* All registration is gated by spark.sql.types.framework.enabled (same as
TimeType).
* When the flag is false, behavior must remain identical to current legacy
paths.
h3. Tests
* With spark.sql.types.framework.enabled=true:
** PhysicalDataType(TimestampNTZNanosType(9)) and LTZ variant return the
correct physical types (not UninitializedPhysicalType).
** Literal.default matches TimestampNanosVal.ZERO for both nanos types.
** InternalRow.getWriter roundtrip: set and read via accessor for NTZ and LTZ.
** SpecificInternalRow update/read for nanos columns.
* With the flag false: regression tests confirm no behavior change vs master
legacy paths.
* Framework-on vs framework-off equivalence tests for the operations above.
h3. Acceptance criteria
* TypeOps(TimestampNTZNanosType(p)) and TypeOps(TimestampLTZNanosType(p))
return non-empty ops when spark.sql.types.framework.enabled=true, for p in {7,
8, 9}.
* Listed integration points use TypeOps implementations and match legacy
behavior.
* spark.sql.types.framework.enabled=false preserves current behavior.
* No change to UnsafeRow layout, TimestampNanosRowValues, or microsecond
TimestampType / TimestampNTZType behavior.
h3. Out of scope
* CatalystTypeConverters and java.time roundtrip (SPARK-57033)
* SerializerBuildHelper / DeserializerBuildHelper and RowEncoder encoders
* ConnectTypeOps and Connect proto literals
* Arrow type mapping and ArrowFieldWriter
* PySpark conversion (EvaluatePython)
* Cast matrix, Parquet read/write, ColumnVector / vectorized Parquet
* Physical ordering, compare, and hash for nanos types
* Removing legacy branches from PhysicalDataType.applyDefault (optional cleanup
in a later issue)
h3. Depends on
* SPARK-56981 (physical row layer and TimestampNanosVal)
h3. References
* SPARK-56822 — parent SPIP
* SPARK-53504 — Types Framework
* Precedent: org.apache.spark.sql.catalyst.types.ops.TimeTypeOps
* Physical value: org.apache.spark.unsafe.types.TimestampNanosVal
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]