[GitHub] [spark] cloud-fan edited a comment on pull request #29448: [Do not merge][SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite

GitBox Mon, 17 Aug 2020 00:58:44 -0700


cloud-fan edited a comment on pull request #29448:
URL: https://github.com/apache/spark/pull/29448#issuecomment-674723823



   Actually this is more tricky than I thought.
   
   In 3.0/2.4 without the unsafe row bug fix:
   1. for hash aggregate with GROUP BY (so that we need a binary hash map), the 
query fails as soon as the overflow happens, due to the unsafe row bug.
   2. for hash aggregate without GROUP BY, or sort aggregate, the sum value is 
actually stored in a `Decimal` object which can hold overflowed value.
   
   (2) is very tricky:
   1. If the overflow happens in the final aggregate, the final `CheckOverflow` 
operator can give us the correct result.
   2. If the overflow happens in the partial aggregate, it produces null, and 
the final aggregate treats null as 0 which indicates empty inputs, and the 
wrong result happens.
   
   The failed test will not work even if we revert the commit, if we change 
input DataFrame partition number to 1, to trigger overflow in partial aggregate.
   
   To give a summary for 3.0/2.4:
   1. for hash aggregate with GROUP BY, we always fail for overflow, even under 
non-ansi mode. This is not ideal but also not a serious bug.
   2. for hash aggregate without GROUP BY, or sort aggregate, Spark returns the 
wrong result if overflow happens in partial aggregate, but is fine if overflow 
happens in final aggregate.
   
   That said, #29404 introduced breaking changes to (2), as it always fails for 
overflow. Let's revert it.
   
   For the unsafe row bug fix #29125, it's important as unsafe row binary is 
used to check equality in many places (join key, grouping key, window partition 
key, etc.), but it also makes (1) worse as Spark may return the wrong result. 
We can simply revert it as well, or we re-apply #29404 only with hash aggregate 
with GROUP BY to avoid breaking changes (can be complex if we consider distinct 
aggregate).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan edited a comment on pull request #29448: [Do not merge][SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite

Reply via email to