[ 
https://issues.apache.org/jira/browse/FLINK-35761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078549#comment-18078549
 ] 

Yuepeng Pan commented on FLINK-35761:
-------------------------------------

Hi, [~fanrui] Could we close it since the all of subtasks completed?

> FLIP-547: Support checkpoint during recovery
> --------------------------------------------
>
>                 Key: FLINK-35761
>                 URL: https://issues.apache.org/jira/browse/FLINK-35761
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.20.0, 1.19.1
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>             Fix For: 2.3.0
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
>  
> Currently, the task will transition state from ExecutionState.INITIALIZING to 
> ExecutionState.RUNNING after all input buffers are processed when job 
> restores from unaligned checkpoint.
> It will cause the restore time is very long if the performance is not strong 
> and unaligned checkpoint snapshotted too many input buffers. From my 
> experience, the restore time will excess 30 minutes when job with high 
> parallelism.
> We hope the job is switched to RUNNING asap. Because the new checkpoint is 
> unable to be triggered during INITIALIZING. If the job is switched to 
> RUNNING, the new unaligned checkpoint can be made.
> h2. Solution:
> In brief:
>  # The task is switched to RUNNING after all input buffers are added to 
> RecoveredInputChannel.
>  ** In general, it's quick unless the network buffer isn't enough.
>  ** When the network buffer isn't enough, it still needs to wait for some 
> buffers are released. (Buffer will be released after a part of data are 
> processed.)
>  # RecoveredInputChannel supports snapshot for network buffers
>  
> Additional improvement:
>  * RecoveredInputChannel only requests the ExclusiveBuffers, and doesn't 
> request the floating buffers.
>  * It cause the network buffer isn't enough for RecoveredInputChannel if the 
> floating buffer is used for old job that creating this checkpoint.
>  * We could let RecoveredInputChannel support request floating buffer in 
> other Jira if this optimization makes sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to