skambha commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-611165856 >>How about we add a new expression DecimalSum? >Analyzer is responsible for converting Sum with decimal type input to DecimalSum. @cloud-fan, Thanks. There are two things I wanted to update. A) I looked at the idea of a new DecimalSum and implementation wise it is ok. I will push those changes right after this comment. A few observations: So when user calls sum function on a decimal column, we can have a rule in analyzer that converts it to decimalsum. Currently, there are some optimizer rules that work on Sum(Decimal Type), those would also require to be changed to look at DecimalSum instead. When user calls sum function, we can use the analyzer or optimizer rule to make the necessary changes to look at decimalsum. But this itself will still not be able to cover the cases where this happens as part of the PhysicalPlan. For e.g. Avg aggregate function as part of its implementation uses the Sum expression. So a analyzer rule will not be enough to make the change transparently. We will need to change the avg implementation. Also for cases similar to Avg, it seems like it will be a maintenance overhead. ie. A function implementor would need to know to use DecimalSum instead of Sum and special case it. A fix inside of sum seems that it is self contained. @cloud-fan So, I wanted your input on this: What is the rationale for preferring a DecimalSum implementation instead of fixing it in Sum? —— B) The algorithm mentioned [here](https://github.com/apache/spark/pull/27627#issuecomment-597440834) does not cover for the scenario where we may have valid null rows. That implementation is uploaded here on my GitHub here just as an fyi. https://github.com/skambha/notes/blob/master/DecimalSum2.scala > initial value is [0, true] So if we have 2 null rows, the expected result should be null and not zero. We need a way to distinguish between valid null rows and null value that come because of a overflow as well. -- I have a solution using a combination of approaches we have discussed so far, that solves the case for valid null rows and overflow cases for all the whole stage enabled/disabled, and for ansi enabled/disabled scenarios in consistent manner.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
