cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-595067541
 
 
   find a way to reproduce without join
   ```
   scala> val decimalStr = "1" + "0" * 19
   decimalStr: String = 10000000000000000000
   
   scala> val df = spark.range(0, 12, 1, 1)
   df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as 
d")).agg(sum($"d")).show
   // This is correct
   +------+
   |sum(d)|
   +------+
   |  null|
   +------+
   
   scala> val df = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
   df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as 
d")).agg(sum($"d")).show
   // This is wrong
   +--------------------+
   |              sum(d)|
   +--------------------+
   |10000000000000000...|
   +--------------------+
   ```
   
   I think the root cause is, `sum` in partial aggregate overflows and write 
null to the unsafe row. `sum` in final aggregate reads null from the unsafe row 
and mistakenly think it's caused by empty data and convert it to 0.
   
   We should create a `DecimalSum`, which use 2 buffer attributes: `sum` and 
`isEmpty`. Then in final aggregate we can check the `isEmpty` flag to konw if 
the null is caused by overflow or empty data.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to