Github user tejasapatil commented on a diff in the pull request:
https://github.com/apache/spark/pull/17056#discussion_r103384950
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
---
@@ -371,6 +370,48 @@ class HashExpressionsSuite extends SparkFunSuite with
ExpressionEvalHelper {
new StructType().add("array", arrayOfString).add("map",
mapOfString))
.add("structOfUDT", structOfUDT))
+ test("hive-hash for decimal") {
+ def checkHiveHashForDecimal(
+ input: String,
+ precision: Int,
+ scale: Int,
+ expected: Long): Unit = {
+ val decimal = Decimal.apply(new java.math.BigDecimal(input))
+ decimal.changePrecision(precision, scale)
+ val decimalType = DataTypes.createDecimalType(precision, scale)
+ checkHiveHash(decimal, decimalType, expected)
+ }
+
+ checkHiveHashForDecimal("18", 38, 0, 558)
+ checkHiveHashForDecimal("-18", 38, 0, -558)
+ checkHiveHashForDecimal("-18", 38, 12, -558)
+ checkHiveHashForDecimal("18446744073709001000", 38, 19, -17070057)
--- End diff --
The main reason why not all of them match is because difference in how
scale and precision are enforced within Hive vs Spark.
Hive does it using its own custom logic :
https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/common/type/HiveDecimal.java#L274
Spark has its own way :
https://github.com/apache/spark/blob/0e2405490f2056728d1353abbac6f3ea177ae533/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L230
Now when one does `CAST(-18446744073709001000BD AS DECIMAL(38,19))`, it
does NOT fit in Hive's range and it will convert it to `null`... and `HASH()`
over `null` will return 0.
In case of Spark, `CAST(-18446744073709001000BD AS DECIMAL(38,19))` is
valid and running `HASH()` over it thus gives some non-zero result.
TLDR: this difference is before the hashing function comes into the
picture. Making this in sync would mean the semantics of Decimal in Spark need
to be matched with that in Hive. I don't think its a good idea to embark on
that as it will be a breaking change plus this PR is not a strong reason to
push for that. Hive-hash is best effort.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]