MaxGekk commented on code in PR #56203:
URL: https://github.com/apache/spark/pull/56203#discussion_r3326454107
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala:
##########
@@ -885,6 +885,50 @@ class HashExpressionsSuite extends SparkFunSuite with
ExpressionEvalHelper {
checkEvaluation(HiveHash(Seq(time)), -1567775210)
}
+ test("HashExpression supports nanosecond timestamp types") {
+ // (epochMicros, nanosWithinMicro) pairs covering zero/mid/max nanos,
negative micros, and
+ // the Long epoch-micro boundaries.
+ val values = Seq(
+ TimestampNanosVal.fromParts(0L, 0.toShort),
+ TimestampNanosVal.fromParts(1L, 1.toShort),
+ TimestampNanosVal.fromParts(1234567890L, 999.toShort),
+ TimestampNanosVal.fromParts(-1L, 500.toShort),
+ TimestampNanosVal.fromParts(Long.MinValue, 0.toShort),
+ TimestampNanosVal.fromParts(Long.MaxValue, 999.toShort))
+
+ Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+ TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+ (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach
{ lit =>
+ // checkEvaluation asserts the interpreted, codegen, and unsafe paths
all agree.
Review Comment:
Minor: with a `Literal` child, the unsafe projection only stores the scalar
hash result, not the `TimestampNanosVal` input, so the unsafe *input* path
isn't actually covered here. Either reword this comment or add the
`BoundReference`-over-row case suggested above so the comment becomes accurate.
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala:
##########
@@ -885,6 +885,50 @@ class HashExpressionsSuite extends SparkFunSuite with
ExpressionEvalHelper {
checkEvaluation(HiveHash(Seq(time)), -1567775210)
}
+ test("HashExpression supports nanosecond timestamp types") {
+ // (epochMicros, nanosWithinMicro) pairs covering zero/mid/max nanos,
negative micros, and
+ // the Long epoch-micro boundaries.
+ val values = Seq(
+ TimestampNanosVal.fromParts(0L, 0.toShort),
+ TimestampNanosVal.fromParts(1L, 1.toShort),
+ TimestampNanosVal.fromParts(1234567890L, 999.toShort),
+ TimestampNanosVal.fromParts(-1L, 500.toShort),
+ TimestampNanosVal.fromParts(Long.MinValue, 0.toShort),
+ TimestampNanosVal.fromParts(Long.MaxValue, 999.toShort))
+
+ Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+ TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+ (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach
{ lit =>
Review Comment:
All hash inputs here are `Literal`s, so the generated code embeds the
`TimestampNanosVal` as a reference object and the ordinal-read codegen path
(`CodeGenerator.getValue` -> `getTimestampNTZNanos`/`getTimestampLTZNanos`) is
never exercised, and the value never round-trips through an `UnsafeRow` as a
hash *input*. `checkEvaluationWithUnsafeProjection` here only projects the
resulting `int`/`long` hash, not the nanos input.
Since the motivation is hash-based GROUP BY / shuffle / joins (where the
input is a `BoundReference` reading from a possibly-unsafe row), could we add a
`BoundReference`-over-row case, e.g.:
```scala
val row = InternalRow(TimestampNanosVal.fromParts(1234567890L, 999.toShort))
val ref = BoundReference(0, dt, nullable = true)
checkEvaluation(Murmur3Hash(Seq(ref), 42), Murmur3Hash(Seq(ref),
42).eval(row), row)
```
This drives the row read + unsafe round-trip that the literal-based tests
skip.
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala:
##########
@@ -885,6 +885,50 @@ class HashExpressionsSuite extends SparkFunSuite with
ExpressionEvalHelper {
checkEvaluation(HiveHash(Seq(time)), -1567775210)
}
+ test("HashExpression supports nanosecond timestamp types") {
+ // (epochMicros, nanosWithinMicro) pairs covering zero/mid/max nanos,
negative micros, and
+ // the Long epoch-micro boundaries.
+ val values = Seq(
+ TimestampNanosVal.fromParts(0L, 0.toShort),
+ TimestampNanosVal.fromParts(1L, 1.toShort),
+ TimestampNanosVal.fromParts(1234567890L, 999.toShort),
+ TimestampNanosVal.fromParts(-1L, 500.toShort),
+ TimestampNanosVal.fromParts(Long.MinValue, 0.toShort),
+ TimestampNanosVal.fromParts(Long.MaxValue, 999.toShort))
+
+ Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+ TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+ (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach
{ lit =>
+ // checkEvaluation asserts the interpreted, codegen, and unsafe paths
all agree.
+ checkEvaluation(Murmur3Hash(Seq(lit), 42), Murmur3Hash(Seq(lit),
42).eval())
Review Comment:
The `expected` value is `Murmur3Hash(Seq(lit), 42).eval()`, i.e. computed by
the same expression under test, so this only proves the eval paths agree with
each other (and, via the second test, that both fields contribute). A bug
shared across all paths (e.g. a wrong constant, or a symmetric field swap)
wouldn't be caught. The existing tests in this suite pin literals (e.g.
`checkEvaluation(HiveHash(Seq(time)), -1567775210)`). Could we pin at least one
golden constant per algorithm for a fixed `(epochMicros, nanosWithinMicro)`
pair?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]