gaoyunhaii commented on pull request #16633: URL: https://github.com/apache/flink/pull/16633#issuecomment-889843328
Hi @dawidwys , it is mainly due to the following case: 1. Suppose we have a job vertex that have two subtasks and indeed need to wait for the final checkpoint. Then the first task called `operator.finish()` and start to wait for the final checkpoint. Then in the following checkpoint, it will report `isOperatorsFinished = true`. 2. Since the second task has not reached here, it will continue to report `isOperatorsFinished=false`. Then the checkpoint would be aborted as expected. 3. Later the second tasks called `operator.finish()` and start to wait for the final checkpoint. Then for the next checkpoint, it will also report `isOperatorsFinished = true`. By here it is as expected since we want to avoid the incomplete checkpoint by force the two tasks wait for the same final checkpoint. 4. However, if the completion notification to the first task succeed, but the one to the second task fails, then the first task would finish waiting and exit. But the second task would continue to wait, what's more, the following checkpoints would always failed since the first task has exit. Then the second task would wait for the final checkpoint forever. Thus we might need to keep the notification reliable if it is send to a vertex who has used `UnionListState` and report `isOperatorsFinished = true` in this checkpoint~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
