gaoyunhaii commented on pull request #16633:
URL: https://github.com/apache/flink/pull/16633#issuecomment-889843328


   Hi @dawidwys , it is mainly due to the following case:
   
   1. Suppose we have a job vertex that have two subtasks and indeed need to 
wait for the final checkpoint. Then the first task called  `operator.finish()` 
and start to wait for the final checkpoint. Then in the following checkpoint, 
it will report `isOperatorsFinished = true`. 
   2. Since the second task has not reached here, it will continue to report 
`isOperatorsFinished=false`. Then the checkpoint would be aborted as expected.
   3. Later the second tasks called `operator.finish()` and start to wait for 
the final checkpoint. Then for the next checkpoint, it will also report 
`isOperatorsFinished = true`. By here it is as expected since we want to avoid 
the incomplete checkpoint by force the two tasks wait for the same final 
checkpoint. 
   4. However, if the completion notification to the first task succeed, but 
the one to the second task fails, then the first task would finish waiting and 
exit. But the second task would continue to wait, what's more, the following 
checkpoints would always failed since the first task has exit. Then the second 
task would wait for the final checkpoint forever. 
   
   Thus we might need to keep the notification reliable if it is send to a 
vertex who has used `UnionListState` and report `isOperatorsFinished = true` in 
this checkpoint~ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to