Max Gekk created SPARK-57103:
--------------------------------
Summary: Add ordering, compare, and hash for nanosecond timestamp
types
Key: SPARK-57103
URL: https://issues.apache.org/jira/browse/SPARK-57103
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h3. Summary
SPARK-56981 added physical storage for TimestampNTZNanosType(p) and
TimestampLTZNanosType(p) (p in [7, 9]) as TimestampNanosVal (epochMicros +
nanosWithinMicro). Values can be written and read from InternalRow / UnsafeRow,
but ordering, comparison, and hashing are not implemented:
PhysicalTimestampNTZNanosType and PhysicalTimestampLTZNanosType throw on
ordering, and hash expressions do not handle the composite value.
This issue adds compare, PhysicalDataType.ordering, and hash support so queries
using ORDER BY, sort, join keys, GROUP BY, DISTINCT, BETWEEN, and hash() /
xxhash64() on nanosecond timestamp columns work.
h3. Background
* Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
* Depends on: SPARK-56981 (TimestampNanosVal, row accessors)
* Logical types: SPARK-56876
* TimestampNanosVal already implements equals and hashCode (manual mix on
epochMicros and nanosWithinMicro); compareTo / Ordering is missing.
* PhysicalDataType.ordering for PhysicalTimestampNTZNanosType /
PhysicalTimestampLTZNanosType currently throws
orderedOperationUnsupportedByDataTypeError (deferred in SPARK-56981).
* hash.scala handles TimestampType and TimestampNTZType as microsecond long
values; there is no branch for nanos composite values.
Comparison semantics: total order on (epochMicros, nanosWithinMicro) in
proleptic-Gregorian epoch-micro timeline; same pair layout for NTZ and LTZ
(zone affects interpretation elsewhere, not the stored pair). NTZ and LTZ
columns are not mutually comparable unless explicit cast rules say otherwise
(out of scope here).
h3. What to do
h4. 1. Compare on TimestampNanosVal
* Add compareTo (or a shared Ordering[TimestampNanosVal]) that orders by
epochMicros, then nanosWithinMicro.
* Handle nulls via existing Catalyst null ordering, not inside compareTo.
* Align with equals: if compare == 0 then values must be equal for normalized
values.
h4. 2. PhysicalDataType.ordering
* Implement ordering on PhysicalTimestampNTZNanosType and
PhysicalTimestampLTZNanosType returning Ordering[TimestampNanosVal] (or
Ordering[Any] as other physical types do).
* Remove orderedOperationUnsupportedByDataTypeError from these physical types.
* Update scaladoc that ordering was deferred.
h4. 3. Hash expressions (interpreted + codegen)
* Extend hash.scala (and related codegen paths) for TimestampNTZNanosType and
TimestampLTZNanosType.
* Hash the composite consistently with TimestampNanosVal.hashCode (epochMicros
and nanosWithinMicro); follow the pattern used for CalendarInterval or other
multi-field physical types where applicable.
* Cover hash and xxhash64 (and murmur3 if other timestamp types do).
h4. 4. Codegen comparison
* Ensure CodeGenerator.genComp / ordering paths for AtomicType or
physical-type-specific branches can compare nanos timestamp columns (may
already route through PhysicalDataType.ordering once implemented; verify
whole-stage codegen and interpreted paths).
h4. 5. Tests
* Unit: compareTo / Ordering on TimestampNanosVal (including negatives, equal
epochMicros different nanosWithinMicro, Long.MinValue / Long.MaxValue
epochMicros).
* SQL: ORDER BY asc/desc on nanos NTZ and LTZ columns.
* SQL: join on nanos timestamp key (equi-join).
* SQL: GROUP BY and DISTINCT on nanos column.
* SQL: hash(expr) and xxhash64(expr) stable and consistent with equals.
* Regression: microsecond TimestampType / TimestampNTZType behavior unchanged.
h3. Acceptance criteria
* ORDER BY on a column of TimestampNTZNanosType or TimestampLTZNanosType
succeeds and sorts by (epochMicros, nanosWithinMicro).
* Equi-join and GROUP BY / DISTINCT on nanos timestamp columns succeed in tests.
* hash() / xxhash64() on nanos timestamp values match expected semantics and
align with equals.
* PhysicalDataType.ordering no longer throws for PhysicalTimestampNTZNanosType
/ PhysicalTimestampLTZNanosType.
* No change to comparison or hash behavior of existing microsecond timestamp
types.
h3. Out of scope
* Cast matrix, type coercion, Parquet read/write, string parsing, java.time
encoders
* Cross-type comparison (nanos LTZ vs micro LTZ, NTZ vs LTZ) except what
existing analyzer already allows via casts
* Types Framework registration (SPARK-57101)
* ColumnVector / vectorized hash (can follow SPARK-57100 separately if needed)
* ANSI interval / timestamp subtraction at nanos precision
h3. Unblocks
* Mid-term SPIP goal: filters, joins, aggregations, and sort on nanosecond
timestamp columns
* Expression and benchmark work that assumes comparable, hashable keys
h3. References
* org.apache.spark.unsafe.types.TimestampNanosVal
* sql/catalyst/.../PhysicalDataType.scala (PhysicalTimestampNTZNanosType /
PhysicalTimestampLTZNanosType)
* sql/catalyst/.../expressions/hash.scala
* Precedent: TIME type hash support (SPARK-51664)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]