Max Gekk created SPARK-57103:
--------------------------------

             Summary: Add ordering, compare, and hash for nanosecond timestamp 
types
                 Key: SPARK-57103
                 URL: https://issues.apache.org/jira/browse/SPARK-57103
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.3.0
            Reporter: Max Gekk


h3. Summary

SPARK-56981 added physical storage for TimestampNTZNanosType(p) and 
TimestampLTZNanosType(p) (p in [7, 9]) as TimestampNanosVal (epochMicros + 
nanosWithinMicro). Values can be written and read from InternalRow / UnsafeRow, 
but ordering, comparison, and hashing are not implemented: 
PhysicalTimestampNTZNanosType and PhysicalTimestampLTZNanosType throw on 
ordering, and hash expressions do not handle the composite value.

This issue adds compare, PhysicalDataType.ordering, and hash support so queries 
using ORDER BY, sort, join keys, GROUP BY, DISTINCT, BETWEEN, and hash() / 
xxhash64() on nanosecond timestamp columns work.

h3. Background

* Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
* Depends on: SPARK-56981 (TimestampNanosVal, row accessors)
* Logical types: SPARK-56876
* TimestampNanosVal already implements equals and hashCode (manual mix on 
epochMicros and nanosWithinMicro); compareTo / Ordering is missing.
* PhysicalDataType.ordering for PhysicalTimestampNTZNanosType / 
PhysicalTimestampLTZNanosType currently throws 
orderedOperationUnsupportedByDataTypeError (deferred in SPARK-56981).
* hash.scala handles TimestampType and TimestampNTZType as microsecond long 
values; there is no branch for nanos composite values.

Comparison semantics: total order on (epochMicros, nanosWithinMicro) in 
proleptic-Gregorian epoch-micro timeline; same pair layout for NTZ and LTZ 
(zone affects interpretation elsewhere, not the stored pair). NTZ and LTZ 
columns are not mutually comparable unless explicit cast rules say otherwise 
(out of scope here).

h3. What to do

h4. 1. Compare on TimestampNanosVal

* Add compareTo (or a shared Ordering[TimestampNanosVal]) that orders by 
epochMicros, then nanosWithinMicro.
* Handle nulls via existing Catalyst null ordering, not inside compareTo.
* Align with equals: if compare == 0 then values must be equal for normalized 
values.

h4. 2. PhysicalDataType.ordering

* Implement ordering on PhysicalTimestampNTZNanosType and 
PhysicalTimestampLTZNanosType returning Ordering[TimestampNanosVal] (or 
Ordering[Any] as other physical types do).
* Remove orderedOperationUnsupportedByDataTypeError from these physical types.
* Update scaladoc that ordering was deferred.

h4. 3. Hash expressions (interpreted + codegen)

* Extend hash.scala (and related codegen paths) for TimestampNTZNanosType and 
TimestampLTZNanosType.
* Hash the composite consistently with TimestampNanosVal.hashCode (epochMicros 
and nanosWithinMicro); follow the pattern used for CalendarInterval or other 
multi-field physical types where applicable.
* Cover hash and xxhash64 (and murmur3 if other timestamp types do).

h4. 4. Codegen comparison

* Ensure CodeGenerator.genComp / ordering paths for AtomicType or 
physical-type-specific branches can compare nanos timestamp columns (may 
already route through PhysicalDataType.ordering once implemented; verify 
whole-stage codegen and interpreted paths).

h4. 5. Tests

* Unit: compareTo / Ordering on TimestampNanosVal (including negatives, equal 
epochMicros different nanosWithinMicro, Long.MinValue / Long.MaxValue 
epochMicros).
* SQL: ORDER BY asc/desc on nanos NTZ and LTZ columns.
* SQL: join on nanos timestamp key (equi-join).
* SQL: GROUP BY and DISTINCT on nanos column.
* SQL: hash(expr) and xxhash64(expr) stable and consistent with equals.
* Regression: microsecond TimestampType / TimestampNTZType behavior unchanged.

h3. Acceptance criteria

* ORDER BY on a column of TimestampNTZNanosType or TimestampLTZNanosType 
succeeds and sorts by (epochMicros, nanosWithinMicro).
* Equi-join and GROUP BY / DISTINCT on nanos timestamp columns succeed in tests.
* hash() / xxhash64() on nanos timestamp values match expected semantics and 
align with equals.
* PhysicalDataType.ordering no longer throws for PhysicalTimestampNTZNanosType 
/ PhysicalTimestampLTZNanosType.
* No change to comparison or hash behavior of existing microsecond timestamp 
types.

h3. Out of scope

* Cast matrix, type coercion, Parquet read/write, string parsing, java.time 
encoders
* Cross-type comparison (nanos LTZ vs micro LTZ, NTZ vs LTZ) except what 
existing analyzer already allows via casts
* Types Framework registration (SPARK-57101)
* ColumnVector / vectorized hash (can follow SPARK-57100 separately if needed)
* ANSI interval / timestamp subtraction at nanos precision

h3. Unblocks

* Mid-term SPIP goal: filters, joins, aggregations, and sort on nanosecond 
timestamp columns
* Expression and benchmark work that assumes comparable, hashable keys

h3. References

* org.apache.spark.unsafe.types.TimestampNanosVal
* sql/catalyst/.../PhysicalDataType.scala (PhysicalTimestampNTZNanosType / 
PhysicalTimestampLTZNanosType)
* sql/catalyst/.../expressions/hash.scala
* Precedent: TIME type hash support (SPARK-51664)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to