skambha commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-611165856
 
 
   >>How about we add a new expression DecimalSum?
   >Analyzer is responsible for converting Sum with decimal type input to 
DecimalSum.
   
   @cloud-fan,  Thanks.  There are two things I wanted to update.  
   
   A) I looked at the idea of a new DecimalSum and implementation wise it is 
ok.   I will push those changes right after this comment.  
   
   A few observations: 
   
   So when user calls sum function on a decimal column, we can have a rule in 
analyzer that converts it to decimalsum.  Currently, there are some optimizer 
rules that work on Sum(Decimal Type), those would also require to be changed to 
look at DecimalSum instead.  
   
   When user calls sum function, we can use the analyzer or optimizer rule to 
make the necessary changes to look at decimalsum.  But this itself will still 
not be able to cover the cases where this happens as part of the PhysicalPlan. 
   
   For e.g. Avg aggregate function as part of its implementation uses the Sum 
expression.   So a analyzer rule will not be enough to make the change 
transparently.  We will need to change the avg implementation.  
   
   Also for cases similar to Avg, it seems like it will be a maintenance 
overhead. ie. A function implementor would need to know to use DecimalSum 
instead of Sum and special case it. A fix inside of sum seems that it is self 
contained. 
   
   @cloud-fan So, I wanted your input on this:   What is the rationale for 
preferring a DecimalSum implementation instead of fixing it in Sum? 
   ——
   
   B) The algorithm mentioned 
[here](https://github.com/apache/spark/pull/27627#issuecomment-597440834) does 
not cover for the scenario where we may have valid null rows.   That 
implementation is uploaded here on my GitHub here just as an fyi. 
   https://github.com/skambha/notes/blob/master/DecimalSum2.scala
   
   > initial value is [0, true]
   
   So if we have 2 null rows, the expected result should be null and not zero.
   We need a way to distinguish between valid null rows and null value that 
come because of a overflow as well. 
   
   --
   I have a solution using a combination of approaches we have discussed so 
far, that solves the case for valid null rows and overflow cases for all the 
whole stage enabled/disabled, and for ansi enabled/disabled scenarios in 
consistent manner.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to