cloud-fan edited a comment on pull request #29448: URL: https://github.com/apache/spark/pull/29448#issuecomment-674723823
Actually this is more tricky than I thought. In 3.0/2.4 without the unsafe row bug fix: 1. for hash aggregate with GROUP BY (so that we need a binary hash map), the query fails as soon as the overflow happens, due to the unsafe row bug. 2. for hash aggregate without GROUP BY, or sort aggregate, the sum value is actually stored in a `Decimal` object which can hold overflowed value. (2) is very tricky: 1. If the overflow happens in the final aggregate, the final `CheckOverflow` operator can give us the correct result. 2. If the overflow happens in the partial aggregate, it produces null, and the final aggregate treats null as 0 which indicates empty inputs, and the wrong result happens. The failed test will not work even if we revert the commit, if we change input DataFrame partition number to 1, to trigger overflow in partial aggregate. To give a summary for 3.0/2.4: 1. for hash aggregate with GROUP BY, we always fail for overflow, even under non-ansi mode. This is not ideal but also not a serious bug. 2. for hash aggregate without GROUP BY, or sort aggregate, Spark returns the wrong result if overflow happens in partial aggregate, but is fine if overflow happens in final aggregate. That said, #29404 introduced breaking changes to (2), as it always fails for overflow. Let's revert it. For the unsafe row bug fix #29125, it's important as unsafe row binary is used to check equality in many places (join key, grouping key, window partition key, etc.), but it also makes (1) worse as Spark may return the wrong result. We can simply revert it as well, or we re-apply #29404 only with hash aggregate with GROUP BY to avoid breaking changes (can be complex if we consider distinct aggregate). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
