[ https://issues.apache.org/jira/browse/FLINK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964855#comment-16964855 ]
Biao Liu commented on FLINK-13905: ---------------------------------- Hi [~pnowojski], {quote}So the periodic trigger would, if there is an ongoing chain of A->B->C, will just enque a request in this queue, otherwise it would trigger "A". Then we also need a manual logic in A, B and C, that if they fail, we re-check the queue or if "C" completes successfully, it also rechecks the queue?{quote} Yes, exactly. There will be a re-checking when a trigger is finished, no matter it's successful or failed. {quote}Isn't it almost the same logic as scheduling the next checkpoint with a delay manually from A, B or C? Without the need for FLINK-13848? {quote} Yes, if checkpoint is manually triggered, we don't need FLINK-13848 and the queue mentioned for the periodic triggering. But there is one thing blocks this approach, the savepoint. The savepoint can be triggered anytime. We have to somehow queue the savepoint trigger request if there is a checkpoint or savepoint ongoing. The queuing and re-checking logic still can't be avoided. The manually triggering seems to be less meaning. {quote}Side note, haven't you implemented something similar or exactly this in one of the PRs, in a commit that was ultimately dropped?{quote} Not yet, there is just a POC, I postponed the PR. I think it's better to have FLINK-14344 first. After all, master hook triggering is a part of {{triggerCheckpoint}}. {quote}In the end, what do you think would be an easier/cleaner/better approach to solve this?{quote} I wish I have a perfect one... It seems that making the whole workflow asynchronous sometimes complicates the implementation. BTW, I have an idea that encapsulating the checkpoint lifecycle in finite state machine model. When the state transits from {{TRIGGERING}} to {{SNAPSHOTTING}} (normally) or {{FAILED}} (exceptionally), it re-checks the queue. In this way, there will be few entrances to do this re-checking. The codes might be easier to read or maintain. It might alleviate the pain but actually the approach is not simplified. Do you have any better idea? > Separate checkpoint triggering into stages > ------------------------------------------ > > Key: FLINK-13905 > URL: https://issues.apache.org/jira/browse/FLINK-13905 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing > Reporter: Biao Liu > Assignee: Biao Liu > Priority: Major > Fix For: 1.10.0 > > > Currently {{CheckpointCoordinator#triggerCheckpoint}} includes some heavy IO > operations. We plan to separate the triggering into different stages. The IO > operations are executed in IO threads, while other on-memory operations are > not. > This is a preparation for making all on-memory operations of > {{CheckpointCoordinator}} single threaded (in main thread). > Note that we could not put on-memory operations of triggering into main > thread directly now. Because there are still some operations on a heavy lock > (coordinator-wide). -- This message was sent by Atlassian Jira (v8.3.4#803005)