HeartSaVioR opened a new pull request #25965: [SPARK-26425][SS] Add more constraint checks in file streaming source to avoid checkpoint corruption URL: https://github.com/apache/spark/pull/25965 ### What changes were proposed in this pull request? Credits to @tdas who reported and described the fix to SPARK-26425. I just followed the description of the issue. This patch adds more checks on file streaming source so that multiple concurrent runs of streaming query don't mess up the status of query/checkpoint. This patch addresses two different spots which are having a bit different issues: 1. HDFSMetadataLog.getLatest() This is pretty weird to allow reading from non-latest batch metadata and treat it as latest. It only happens when the query succeeds to find the latest batch file from listing but the file is deleted just before reading. It should have treated as critical and end users should be indicated this as it means metadata is being modified from other query (or manually) which is unsafe. 2. FileStreamSource.fetchMaxOffset() In structured streaming, we don't allow multiple streaming queries to run with same checkpoint, so query should fail if it fails to write the metadata of specific batch ID due to same batch ID being written by others. As described in JIRA issue, assertion is already applied to the `offsetLog` for the same reason. https://github.com/apache/spark/blob/8167714cab93a5c06c23f92c9077fe8b9677ab28/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L394-L402 ### Why are the changes needed? This prevents the inconsistent behavior on streaming query and lets query fail instead. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A, as the change is simple and obvious, and it's really hard to artificially reproduce the issue.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
