[jira] [Commented] (FLINK-13905) Separate checkpoint triggering into stages

Biao Liu (Jira) Fri, 01 Nov 2019 07:12:55 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964855#comment-16964855
 ]


Biao Liu commented on FLINK-13905:
----------------------------------

Hi [~pnowojski],
{quote}So the periodic trigger would, if there is an ongoing chain of A->B->C, 
will just enque a request in this queue, otherwise it would trigger "A". Then 
we also need a manual logic in A, B and C, that if they fail, we re-check the 
queue or if "C" completes successfully, it also rechecks the queue?{quote}
Yes, exactly. There will be a re-checking when a trigger is finished, no matter 
it's successful or failed.

{quote}Isn't it almost the same logic as scheduling the next checkpoint with a 
delay manually from A, B or C? Without the need for FLINK-13848? {quote}
Yes, if checkpoint is manually triggered, we don't need FLINK-13848 and the 
queue mentioned for the periodic triggering. But there is one thing blocks this 
approach, the savepoint. The savepoint can be triggered anytime. We have to 
somehow queue the savepoint trigger request if there is a checkpoint or 
savepoint ongoing. The queuing and re-checking logic still can't be avoided. 
The manually triggering seems to be less meaning. 

{quote}Side note, haven't you implemented something similar or exactly this in 
one of the PRs, in a commit that was ultimately dropped?{quote}
Not yet, there is just a POC, I postponed the PR. I think it's better to have 
FLINK-14344 first. After all, master hook triggering is a part of 
{{triggerCheckpoint}}. 

{quote}In the end, what do you think would be an easier/cleaner/better approach 
to solve this?{quote}
I wish I have a perfect one... It seems that making the whole workflow 
asynchronous sometimes complicates the implementation.
BTW, I have an idea that encapsulating the checkpoint lifecycle in finite state 
machine model. When the state transits from {{TRIGGERING}} to {{SNAPSHOTTING}} 
(normally) or {{FAILED}} (exceptionally), it re-checks the queue. In this way, 
there will be few entrances to do this re-checking. The codes might be easier 
to read or maintain. It might alleviate the pain but actually the approach is 
not simplified.

Do you have any better idea?


> Separate checkpoint triggering into stages
> ------------------------------------------
>
>                 Key: FLINK-13905
>                 URL: https://issues.apache.org/jira/browse/FLINK-13905
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Biao Liu
>            Assignee: Biao Liu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently {{CheckpointCoordinator#triggerCheckpoint}} includes some heavy IO 
> operations. We plan to separate the triggering into different stages. The IO 
> operations are executed in IO threads, while other on-memory operations are 
> not.
> This is a preparation for making all on-memory operations of 
> {{CheckpointCoordinator}} single threaded (in main thread).
> Note that we could not put on-memory operations of triggering into main 
> thread directly now. Because there are still some operations on a heavy lock 
> (coordinator-wide).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-13905) Separate checkpoint triggering into stages

Reply via email to