skambha commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-596863783 I looked into why the spark.range(0,12,1,1) works and the other scenario returns wrong results that is mentioned in https://github.com/apache/spark/pull/27627#issuecomment-595067541 **Case 1** that used spark.range(0,12,1,1) - this returns null. Please see this file for details. https://github.com/skambha/notes/blob/master/case1_works_notes.txt There is only one partition and 1 wholestagecodegen subtree. There is no intermediate writing out via UnsafeRowWriter. Hence all the values are added up which ends up being a value that is a overflow value and then the check overflow code kicks in, that sees that is is a overflow value and it will turn it to a null. The null then gets written out. 1. Partition is 1. 2. This has 1 WholeStageCodegen subtrees. 3. So the updateExpression codegen code will do the add ie +, for all the 12 decimal values and then it _**does not**_ write it out to UnsafeRow via UnsafeRowWriter This will amount to `120000000000000000000.000000000000000000 `which is overflowed value. 4. This then goes to the CheckOverflow code which checks the decimal value with precision and it overflows and this call will evaluate to null. ``` /* 239 */ agg_value_1 = agg_mutableStateArray_0[0].toPrecision( /* 240 */ 38, 18, Decimal.ROUND_HALF_UP(), true); ``` 5. The null is then written out via UnsafeRowWriter.write call below. ``` /* 250 */ if (agg_isNull_1) { /* 251 */ range_mutableStateArray_0[3].write(0, (Decimal) null, 38, 18); ``` This explains why this scenario works fine. **Case 2** that used spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1)) and got wrong results Please see the file for details: https://github.com/skambha/notes/blob/master/case2_wrongresults.txt 1. Wholestagecodegen has 4 subtrees. 2. NumPartitions is 2 3. 11 rows in one partition and 1 row in another partition 4. For the partition with 11 rows, the 11 rows add up to a value that is `110000000000000000000.000000000000000000 `which is not containable in dec(38,18) and in UnsafeRowWriter, this will get written out as `null`. 5. For the partition that has 1 row, the sum value is `10000000000000000000.000000000000000000` which is containable. 6. Next, it does a merge of the two a) null and b) `10000000000000000000` using the coalesce and + expression in mergeExpressions and then that value is checked for overflow or not. This value is containable, so it returns the wrong result. (Sidenote: If you split the range to 2 and 11 elements, you can see the result is `20000000000000000000.000000000000000000`) --- We already saw that when we write out to UnsafeRow an overflow value of decimal, it will write a null value. So, whether the wrong results repros or not is basically - a combination of the values in each partition and when the sum of the values in that partition would become a overflow value **and** if that would end up being written out via UnsafeRowWriter, then that partition's sum will result being a **null,** **and** - then in the subsequent execution, when the values for each partition are sum'd for merge phase, if the resultant value is a overflow value or not. - If it is a overflow value, then due to the overflow check( coming from evaluateExpression), the resultant value will become null and then gets written out. - If it is not a overflow value, the resultant value(which could be wrong) will get written out.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
