yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-495915181
 
 
   Hi @StefanRRichter I understand what you mean and have given a reply. I 
think this PR has a relationship with FLINK-11662, but there is no dependency. 
If FLINK-11662 can be completed before this PR, it will make the 
CheckpointFailureManager looks better. But if it is resolved after this PR, it 
will not affect the semantics of this PR. I think the semantics of this PR is: 
tolerate failure in the situation we have controlled, and the current unhandled 
exception is not tolerated. In addition, the failure of the job caused by 
FLINK-11662 will cause the entire job to start again.
   
   Isn't the introduction of `CheckpointFailureManager` so that it can handle 
`CheckpointFailureReason` better? In fact, `CheckpointFailureReason` is a 
change point (it may increase because maybe there are other issues like 
FLINK-11662 we do not find or track them, or it may decrease in the future), we 
encapsulate its counting logic in the `handleCheckpointException` to change 
their counting logic in the future.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to