zoltar9264 commented on PR #22669:
URL: https://github.com/apache/flink/pull/22669#issuecomment-1593370772

   Sorry for the late reply @rkhachatryan , I spent some time troubleshooting 
the [failed ci 
test](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49906&view=logs&j=013809d8-015f-5db0-3ef7-15ff765ce858&t=fbb13c77-eef1-533a-49f6-142d8268831b&l=8824)
 after this change, in which some materialized file could not be found. 
Finally, I found that if the state changelog is enabled and materialization use 
RocksDB incremental snapshot and the materialization interval is close to the 
checkpoint interval , that the following case will cause the problem:
   
   1. materialization-1 trigger and completed,and 1.sst be uploaded as 
remote-file-1
   2. checkpoint-1 trigger and completed, it use result of materialization-1, 
so it reference remote-file-1
   3. materialication-2 trigger and completed, but checkpoint-1 not confirmed, 
materialization-1 not confirmed too, so materialization-2 not based on 
materialization-1, it re-upload 1.sst as remote-file-2
   4. checkpoint-1 confirmed, materialization-1 confirmed
   5. checkpoint-2 trigger and completed, it use result of materialization-2, 
so it reference remote-file-2 instead of remote-file-1.
   6. checkpoint-1 be subsumed in jobmanager side, SharedStateRegistry discard 
remote-file-1 since it not be referenced by lowestCheckpoint (checkpoint-2)
   7. materialization-3 trigger and completed, and it was based on 
materialization-1 (reference remote-file-1)
   8. checkpoint-2 confirmed, materialization-2 confirmed
   9. checkpoint-3 trigger and completed, it use result of materialization-3, 
so it reference remote-file-1, but remote-file-1 has been deleted
   
   The reason for this problem is that the SharedStateRegistry potentially 
requires that the checkpoint's dependency on the shared file be continuous, but 
the current materialization confirmation mechanism breaks this. In the current 
PR I bypassed this problem by increasing the materialization interval in CI 
tests. But I do suggest the following change to actually fix the problem: Don't 
trigger a new materialization until the previous one has either confirmed or 
failed. WDYT @rkhachatryan ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to