[GitHub] [spark] skambha commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

GitBox Mon, 09 Mar 2020 19:06:24 -0700

skambha commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-596863783
 
 
   I looked into why the spark.range(0,12,1,1) works and the other scenario 
returns wrong results 
   that is mentioned in 
https://github.com/apache/spark/pull/27627#issuecomment-595067541
   
   **Case 1** that used  spark.range(0,12,1,1)  - this returns null.  
   Please see this file for details. 
https://github.com/skambha/notes/blob/master/case1_works_notes.txt
   
   There is only one partition and 1 wholestagecodegen subtree.  There is no 
intermediate writing out via UnsafeRowWriter.  Hence all the values are added 
up which ends up being a value that is a overflow value and then the check 
overflow code kicks in, that sees that is is a overflow value and it will turn 
it to a null.    The null then gets written out.
   
   1. Partition is 1. 
   2. This has 1 WholeStageCodegen subtrees.
   3. So the updateExpression codegen code will do the add ie +, for all the 12 
decimal values and then it _**does not**_ write it out to UnsafeRow via 
UnsafeRowWriter
   This will amount to `120000000000000000000.000000000000000000 `which is 
overflowed value.  
   4. This then goes to the CheckOverflow code which checks the decimal value 
with precision and it overflows and this call will evaluate to null. 
   ```
   /* 239 */         agg_value_1 = agg_mutableStateArray_0[0].toPrecision(
   /* 240 */           38, 18, Decimal.ROUND_HALF_UP(), true);
   ```
   5. The null is then written out via UnsafeRowWriter.write call below.
   ```
   /* 250 */       if (agg_isNull_1) {
   /* 251 */         range_mutableStateArray_0[3].write(0, (Decimal) null, 38, 
18);
   ```
   This explains why this scenario works fine. 
   
   **Case 2** that used spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1)) 
and got wrong results
   Please see the file for details: 
   https://github.com/skambha/notes/blob/master/case2_wrongresults.txt
   
   1. Wholestagecodegen has 4 subtrees.
   2. NumPartitions is 2
   3. 11 rows in one partition and 1 row in another partition
   4. For the partition with 11 rows,  the 11 rows add up to a value that is 
`110000000000000000000.000000000000000000 `which is not containable in 
dec(38,18) and in UnsafeRowWriter, this will get written out as `null`.
   5. For the partition that has 1 row, the sum value is 
`10000000000000000000.000000000000000000` which is containable.
   6. Next, it does a merge of the two  a) null and b) `10000000000000000000` 
using the coalesce  and + expression in mergeExpressions and then that value is 
checked for overflow or not. This value is containable, so it returns the wrong 
result.
   
   (Sidenote: If you split the range to 2 and 11 elements, you can see the 
result is `20000000000000000000.000000000000000000`)
   
   ---
   We already saw that when we write out to UnsafeRow an overflow value of 
decimal, it will write a null value.  
   
   So, whether the wrong results repros or not is basically 
   - a combination of the values in each partition and when the sum of the 
values in that partition would become a overflow value **and**  if that would 
end up being written out via UnsafeRowWriter, then that partition's sum will 
result being a **null,** **and** 
   - then in the subsequent execution, when the values for each partition are 
sum'd for merge phase, if the resultant value is a overflow value or not.   
     - If it is a overflow value, then due to the overflow check( coming from 
evaluateExpression), the resultant value will become null and then gets written 
out.  
     - If it is not a overflow value, the resultant value(which could be wrong) 
will get written out.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] skambha commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

Reply via email to