skambha commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results 
during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-589479357
 
 
   > This PR would introduce regressions. Checking every sum means that 
temporary overflows would cause an exception. Eg., if you sum MAX_INT, 10, 
-100, then MAX_INT + 10 would cause the exception. In the current code, this 
sum is handled properly and returns the correct result, because the temporary 
overflow is fixed by summing -100. So we would raise exceptions even when not 
needed. IIRC, other DBs treat this properly, so temporary overflow don't cause 
exceptions.
   > 
   
   I see what you are saying, but this PR is targeted to the Aggregate sum of 
the decimal type (result type is decimal type) only and not for int or long.  
Sum of ints is handled the same way as before and does not introduce any 
regressions for the above mentioned use case. [1]
   
   This PR is trying to handle the use case regarding aggregate Sum for decimal:
   - Sum of decimal type overflows and returns wrong results.
   - Note, In the current code (without this PR also), the same operation of 
sum on decimal type will throw an exception when whole stage code gen is 
disabled.
   
   (Furthermore, even if spark.sql.ansi.enabled is set to true, we do not 
return null.  This conf property is to ensure that any overflows will return 
null.)
   
   Here, we are dealing with a correctness issue.  This pr's approach is to 
follow the result returned by the whole stage codegen codepath. 
   
   Actually this issue is mentioned in  PR/SPARK-23179 [3] as a special case.  
SPARK-28224 partially addressed this.
   
   fwiw, I checked this on MS SQL Server and it throws an error as well. [2]
   
   > The proper fix for this would be to use as buffer a larger data type than 
the returned one. I remember I had a PR for that (#25347). You can check the 
comments and history of it.
   
   Sure. I checked this (#25347), and this deals with increasing the datatype 
for the aggregate sum of long's to decimal to avoid temporary overflow.  The 
decision was to not make the change because a)  since it is not a correctness 
issue, and b) because of the performance hit and c) workaround exists - that if 
the user sees exception because of temporary overflow, they can cast it to a 
decimal.   [4].  
   
   [1] —>[ SPARK-26218 
](https://issues.apache.org/jira/browse/SPARK-26218)Overflow on arithmetic 
operations returns incorrect result   
   [2]  http://sqlfiddle.com/#!18/e7ecc/1 
   [3] —> [SPARK-23179 
](https://issues.apache.org/jira/browse/SPARK-23179)Support option to throw 
exception if overflow occurs during Decimal arithmetic 
   [4] https://github.com/apache/spark/pull/25347#issuecomment-525763443
   
   Thanks for your comments. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to