skambha opened a new pull request #27629: [SPARK-28067][SQL]Fix incorrect 
results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629
 
 
   ### What changes were proposed in this pull request?
   
   JIRA SPARK-28067:  Wrong results are returned for aggregate sum with 
decimals with whole stage codegen enabled
   
   **Repro:** 
   WholeStage enabled enabled ->  Wrong results
   WholeStage disabled -> Returns exception Decimal precision 39 exceeds max 
precision 38
   
   **Issues:** 
   1. Wrong results are returned which is bad 
   2. Inconsistency between whole stage enabled and disabled. 
   
   **Cause:**
   Sum does not take care of possibility of overflow for the intermediate 
steps.  ie the updateExpressions and mergeExpressions.  
   
   **This PR makes the following changes:**
   - Throw exception if there is an decimal overflow when computing the sum.
   - This will be consistent with how Spark behaves when whole stage codegen is 
disabled.
   
   **Pros:** 
   - No wrong results
   - Consistent behavior between wholestage enabled and disabled
   - DB’s have similar behavior, there is precedence
   
   **Before Fix:** - WRONG RESULTS
   ```
   scala> val df = Seq(
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
   df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
   
   scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
   df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
   
   scala> df2.show(40,false)
   +---------------------------------------+                                    
   
   |sum(decNum)                            |
   +---------------------------------------+
   |20000000000000000000.000000000000000000|
   +---------------------------------------+
   ```
   
   **After fix:**
   ```
   scala> val df = Seq(
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
   df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
   
   scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
   df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
   
   scala> df2.show(40,false)
   20/02/18 13:36:19 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 9) 
   
   java.lang.ArithmeticException: 
Decimal(expanded,100000000000000000000.000000000000000000,39,18}) cannot be 
represented as Decimal(38, 18).
   ```
   
   ### Why are the changes needed?
   The changes are needed in order to fix the wrong results that are returned 
for decimal aggregate sum.
   
   ### Does this PR introduce any user-facing change?
   Prior to this fix, user would see wrong results on aggregate sum that 
involved decimal overflow, but now the user will see exception. This behavior 
is consistent as well with how Spark behaves when whole stage code gen is 
disabled.
   
   ### How was this patch tested?
   New test has been added and existing tests for sql, catalyst and hive suites 
were run ok.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to