Re: [PR] [SPARK-57103][SQL] Add hashing for nanosecond timestamp types [spark]

via GitHub Fri, 29 May 2026 12:22:55 -0700


MaxGekk commented on code in PR #56203:
URL: https://github.com/apache/spark/pull/56203#discussion_r3326454107



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala:
##########
@@ -885,6 +885,50 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
     checkEvaluation(HiveHash(Seq(time)), -1567775210)
   }
 
+  test("HashExpression supports nanosecond timestamp types") {
+    // (epochMicros, nanosWithinMicro) pairs covering zero/mid/max nanos, 
negative micros, and
+    // the Long epoch-micro boundaries.
+    val values = Seq(
+      TimestampNanosVal.fromParts(0L, 0.toShort),
+      TimestampNanosVal.fromParts(1L, 1.toShort),
+      TimestampNanosVal.fromParts(1234567890L, 999.toShort),
+      TimestampNanosVal.fromParts(-1L, 500.toShort),
+      TimestampNanosVal.fromParts(Long.MinValue, 0.toShort),
+      TimestampNanosVal.fromParts(Long.MaxValue, 999.toShort))
+
+    Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+        TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+      (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach 
{ lit =>
+        // checkEvaluation asserts the interpreted, codegen, and unsafe paths 
all agree.

Review Comment:
   Minor: with a `Literal` child, the unsafe projection only stores the scalar 
hash result, not the `TimestampNanosVal` input, so the unsafe *input* path 
isn't actually covered here. Either reword this comment or add the 
`BoundReference`-over-row case suggested above so the comment becomes accurate.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala:
##########
@@ -885,6 +885,50 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
     checkEvaluation(HiveHash(Seq(time)), -1567775210)
   }
 
+  test("HashExpression supports nanosecond timestamp types") {
+    // (epochMicros, nanosWithinMicro) pairs covering zero/mid/max nanos, 
negative micros, and
+    // the Long epoch-micro boundaries.
+    val values = Seq(
+      TimestampNanosVal.fromParts(0L, 0.toShort),
+      TimestampNanosVal.fromParts(1L, 1.toShort),
+      TimestampNanosVal.fromParts(1234567890L, 999.toShort),
+      TimestampNanosVal.fromParts(-1L, 500.toShort),
+      TimestampNanosVal.fromParts(Long.MinValue, 0.toShort),
+      TimestampNanosVal.fromParts(Long.MaxValue, 999.toShort))
+
+    Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+        TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+      (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach 
{ lit =>

Review Comment:
   All hash inputs here are `Literal`s, so the generated code embeds the 
`TimestampNanosVal` as a reference object and the ordinal-read codegen path 
(`CodeGenerator.getValue` -> `getTimestampNTZNanos`/`getTimestampLTZNanos`) is 
never exercised, and the value never round-trips through an `UnsafeRow` as a 
hash *input*. `checkEvaluationWithUnsafeProjection` here only projects the 
resulting `int`/`long` hash, not the nanos input.
   
   Since the motivation is hash-based GROUP BY / shuffle / joins (where the 
input is a `BoundReference` reading from a possibly-unsafe row), could we add a 
`BoundReference`-over-row case, e.g.:
   
   ```scala
   val row = InternalRow(TimestampNanosVal.fromParts(1234567890L, 999.toShort))
   val ref = BoundReference(0, dt, nullable = true)
   checkEvaluation(Murmur3Hash(Seq(ref), 42), Murmur3Hash(Seq(ref), 
42).eval(row), row)
   ```
   
   This drives the row read + unsafe round-trip that the literal-based tests 
skip.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala:
##########
@@ -885,6 +885,50 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
     checkEvaluation(HiveHash(Seq(time)), -1567775210)
   }
 
+  test("HashExpression supports nanosecond timestamp types") {
+    // (epochMicros, nanosWithinMicro) pairs covering zero/mid/max nanos, 
negative micros, and
+    // the Long epoch-micro boundaries.
+    val values = Seq(
+      TimestampNanosVal.fromParts(0L, 0.toShort),
+      TimestampNanosVal.fromParts(1L, 1.toShort),
+      TimestampNanosVal.fromParts(1234567890L, 999.toShort),
+      TimestampNanosVal.fromParts(-1L, 500.toShort),
+      TimestampNanosVal.fromParts(Long.MinValue, 0.toShort),
+      TimestampNanosVal.fromParts(Long.MaxValue, 999.toShort))
+
+    Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+        TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+      (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach 
{ lit =>
+        // checkEvaluation asserts the interpreted, codegen, and unsafe paths 
all agree.
+        checkEvaluation(Murmur3Hash(Seq(lit), 42), Murmur3Hash(Seq(lit), 
42).eval())

Review Comment:
   The `expected` value is `Murmur3Hash(Seq(lit), 42).eval()`, i.e. computed by 
the same expression under test, so this only proves the eval paths agree with 
each other (and, via the second test, that both fields contribute). A bug 
shared across all paths (e.g. a wrong constant, or a symmetric field swap) 
wouldn't be caught. The existing tests in this suite pin literals (e.g. 
`checkEvaluation(HiveHash(Seq(time)), -1567775210)`). Could we pin at least one 
golden constant per algorithm for a fixed `(epochMicros, nanosWithinMicro)` 
pair?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57103][SQL] Add hashing for nanosecond timestamp types [spark]

Reply via email to