Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17056#discussion_r103384950
  
    --- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
    @@ -371,6 +370,48 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
             new StructType().add("array", arrayOfString).add("map", 
mapOfString))
           .add("structOfUDT", structOfUDT))
     
    +  test("hive-hash for decimal") {
    +    def checkHiveHashForDecimal(
    +        input: String,
    +        precision: Int,
    +        scale: Int,
    +        expected: Long): Unit = {
    +      val decimal = Decimal.apply(new java.math.BigDecimal(input))
    +      decimal.changePrecision(precision, scale)
    +      val decimalType = DataTypes.createDecimalType(precision, scale)
    +      checkHiveHash(decimal, decimalType, expected)
    +    }
    +
    +    checkHiveHashForDecimal("18", 38, 0, 558)
    +    checkHiveHashForDecimal("-18", 38, 0, -558)
    +    checkHiveHashForDecimal("-18", 38, 12, -558)
    +    checkHiveHashForDecimal("18446744073709001000", 38, 19, -17070057)
    --- End diff --
    
    The main reason why not all of them match is because difference in how 
scale and precision are enforced within Hive vs Spark.
    
    Hive does it using its own custom logic : 
https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/common/type/HiveDecimal.java#L274
    
    Spark has its own way : 
https://github.com/apache/spark/blob/0e2405490f2056728d1353abbac6f3ea177ae533/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L230
    
    Now when one does `CAST(-18446744073709001000BD AS DECIMAL(38,19))`, it 
does NOT fit in Hive's range and it will convert it to `null`... and `HASH()` 
over `null` will return 0.
    
    In case of Spark, `CAST(-18446744073709001000BD AS DECIMAL(38,19))` is 
valid and running `HASH()` over it thus gives some non-zero result.
     
    TLDR: this difference is before the hashing function comes into the 
picture. Making this in sync would mean the semantics of Decimal in Spark need 
to be matched with that in Hive. I don't think its a good idea to embark on 
that as it will be a breaking change plus this PR is not a strong reason to 
push for that. Hive-hash is best effort.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to