[GitHub] [spark] HeartSaVioR commented on pull request #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

GitBox Mon, 01 Jun 2020 05:14:02 -0700


HeartSaVioR commented on pull request #27627:
URL: https://github.com/apache/spark/pull/27627#issuecomment-636823889



   This PR looks to "selectively" change the aggregation buffer (not entirely) 
- I don't know the possibility/ratio of actual usage on decimal for input on 
sum so it's not easy to say, but we let the streaming queries which contain 
stream-stream outer join "fail" if it starts with previous version of state.
   
   https://spark.apache.org/docs/3.0.0-preview2/ss-migration-guide.html
   
   > Spark 3.0 fixes the correctness issue on Stream-stream outer join, which 
changes the schema of state. (SPARK-26154 for more details) Spark 3.0 will fail 
the query if you start your query from checkpoint constructed from Spark 2.x 
which uses stream-stream outer join. Please discard the checkpoint and replay 
previous inputs to recalculate outputs.
   
   SPARK-26154 was decided to only put to Spark 3.0.0 because of backward 
incompatibility. IMHO this sounds like a similar case - it should be landed on 
Spark 3.0.0, not sure we do want to make breaking change on Spark 2.4.x.
   
   Btw, this was simply possible because we added "versioning" of the state, 
but we have been applying the versioning for the operator rather than 
"specific" aggregate function, hence this hasn't been the case yet.
   
   If we feel beneficial to construct the way to apply versioning for specific 
aggregate function (at least built-in) and someone has the straightforward way 
to deal with this, it would be ideal to do that.
   
   Otherwise, I'd rather say we should make it fail, but shouldn't be with 
"arbitrary" error message. 
[SPARK-27237](https://issues.apache.org/jira/browse/SPARK-27237) (#24173) will 
make it possible to show the schema incompatibility and why it's considered as 
incompatible.
   
   I'd also propose to introduce state data source on batch query via 
[SPARK-28190](https://issues.apache.org/jira/browse/SPARK-28190) so that end 
users can easily read and even rewrite the state, which eventually makes end 
users possible to migrate from old state format to the new state format. (It's 
available on [separate 
project](https://github.com/HeartSaVioR/spark-state-tools) for Spark 2.4.x, but 
given there were some relevant requests in user mailing list - e.g. pre-load 
initial state, schema evolution - it's easier to work within the Spark project.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

Reply via email to