HeartSaVioR commented on pull request #27627: URL: https://github.com/apache/spark/pull/27627#issuecomment-636823889
This PR looks to "selectively" change the aggregation buffer (not entirely) - I don't know the possibility/ratio of actual usage on decimal for input on sum so it's not easy to say, but we let the streaming queries which contain stream-stream outer join "fail" if it starts with previous version of state. https://spark.apache.org/docs/3.0.0-preview2/ss-migration-guide.html > Spark 3.0 fixes the correctness issue on Stream-stream outer join, which changes the schema of state. (SPARK-26154 for more details) Spark 3.0 will fail the query if you start your query from checkpoint constructed from Spark 2.x which uses stream-stream outer join. Please discard the checkpoint and replay previous inputs to recalculate outputs. SPARK-26154 was decided to only put to Spark 3.0.0 because of backward incompatibility. IMHO this sounds like a similar case - it should be landed on Spark 3.0.0, not sure we do want to make breaking change on Spark 2.4.x. Btw, this was simply possible because we added "versioning" of the state, but we have been applying the versioning for the operator rather than "specific" aggregate function, hence this hasn't been the case yet. If we feel beneficial to construct the way to apply versioning for specific aggregate function (at least built-in) and someone has the straightforward way to deal with this, it would be ideal to do that. Otherwise, I'd rather say we should make it fail, but shouldn't be with "arbitrary" error message. [SPARK-27237](https://issues.apache.org/jira/browse/SPARK-27237) (#24173) will make it possible to show the schema incompatibility and why it's considered as incompatible. I'd also propose to introduce state data source on batch query via [SPARK-28190](https://issues.apache.org/jira/browse/SPARK-28190) so that end users can easily read and even rewrite the state, which eventually makes end users possible to migrate from old state format to the new state format. (It's available on [separate project](https://github.com/HeartSaVioR/spark-state-tools) for Spark 2.4.x, but given there were some relevant requests in user mailing list - e.g. pre-load initial state, schema evolution - it's easier to work within the Spark project.) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
