cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-595067541 find a way to reproduce without join ``` scala> val decimalStr = "1" + "0" * 19 decimalStr: String = 10000000000000000000 scala> val df = spark.range(0, 12, 1, 1) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as d")).agg(sum($"d")).show // This is correct +------+ |sum(d)| +------+ | null| +------+ scala> val df = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1)) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as d")).agg(sum($"d")).show // This is wrong +--------------------+ | sum(d)| +--------------------+ |10000000000000000...| +--------------------+ ``` I think the root cause is, `sum` in partial aggregate overflows and write null to the unsafe row. `sum` in final aggregate reads null from the unsafe row and mistakenly think it's caused by empty data and convert it to 0. We should create a `DecimalSum`, which use 2 buffer attributes: `sum` and `isEmpty`. Then in final aggregate we can check the `isEmpty` flag to konw if the null is caused by overflow or empty data.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org