zoltar9264 commented on PR #22669: URL: https://github.com/apache/flink/pull/22669#issuecomment-1593370772
Sorry for the late reply @rkhachatryan , I spent some time troubleshooting the [failed ci test](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49906&view=logs&j=013809d8-015f-5db0-3ef7-15ff765ce858&t=fbb13c77-eef1-533a-49f6-142d8268831b&l=8824) after this change, in which some materialized file could not be found. Finally, I found that if the state changelog is enabled and materialization use RocksDB incremental snapshot and the materialization interval is close to the checkpoint interval , that the following case will cause the problem: 1. materialization-1 trigger and completed,and 1.sst be uploaded as remote-file-1 2. checkpoint-1 trigger and completed, it use result of materialization-1, so it reference remote-file-1 3. materialication-2 trigger and completed, but checkpoint-1 not confirmed, materialization-1 not confirmed too, so materialization-2 not based on materialization-1, it re-upload 1.sst as remote-file-2 4. checkpoint-1 confirmed, materialization-1 confirmed 5. checkpoint-2 trigger and completed, it use result of materialization-2, so it reference remote-file-2 instead of remote-file-1. 6. checkpoint-1 be subsumed in jobmanager side, SharedStateRegistry discard remote-file-1 since it not be referenced by lowestCheckpoint (checkpoint-2) 7. materialization-3 trigger and completed, and it was based on materialization-1 (reference remote-file-1) 8. checkpoint-2 confirmed, materialization-2 confirmed 9. checkpoint-3 trigger and completed, it use result of materialization-3, so it reference remote-file-1, but remote-file-1 has been deleted The reason for this problem is that the SharedStateRegistry potentially requires that the checkpoint's dependency on the shared file be continuous, but the current materialization confirmation mechanism breaks this. In the current PR I bypassed this problem by increasing the materialization interval in CI tests. But I do suggest the following change to actually fix the problem: Don't trigger a new materialization until the previous one has either confirmed or failed. WDYT @rkhachatryan ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
