Hi all,
I will try to start coding based on the design document. Any feedback is
welcome throughout the process.
Best,
Vino
vino yang 于2019年1月9日周三 上午12:29写道:
> Hi all,
>
>
> Currently, the checkpoint's failure handling logic is somewhat confusing
> (not focused), which makes some functions on existing code passive.
>
> So I provide a design document to improve the Checkpoint failure process
> logic.
>
> This design document primarily describes how to improve checkpoint failure
> handling logic and make it more clear.
>
> Based on this, we introduce a CheckpointFailureManager, which makes the
> checkpoint failure processing more flexible.
>
> This mainly comes from the following appeals:
>
>
>-
>
>FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after
>"n" unsuccessful checkpoints
>-
>
>FLINK-10074[3]: Allowable number of checkpoint failure
>-
>
>FLINK-10724[2]: Refactor failure handling in checkpoint coordinator
>
>
>
> https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing
>
> *Thanks to @Andrey Zagrebin for helping me review the documentation and
> suggesting a lot of improvements.*
>
> Feedback and comments are very welcome!
>
> Best,
> Vino
>
> [1]: https://issues.apache.org/jira/browse/FLINK-4810
>
> [2]: https://issues.apache.org/jira/browse/FLINK-10724
> [3]: https://issues.apache.org/jira/browse/FLINK-10074
>