[
https://issues.apache.org/jira/browse/FLINK-35761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078549#comment-18078549
]
Yuepeng Pan commented on FLINK-35761:
-------------------------------------
Hi, [~fanrui] Could we close it since the all of subtasks completed?
> FLIP-547: Support checkpoint during recovery
> --------------------------------------------
>
> Key: FLINK-35761
> URL: https://issues.apache.org/jira/browse/FLINK-35761
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.20.0, 1.19.1
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
> Fix For: 2.3.0
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
>
> Currently, the task will transition state from ExecutionState.INITIALIZING to
> ExecutionState.RUNNING after all input buffers are processed when job
> restores from unaligned checkpoint.
> It will cause the restore time is very long if the performance is not strong
> and unaligned checkpoint snapshotted too many input buffers. From my
> experience, the restore time will excess 30 minutes when job with high
> parallelism.
> We hope the job is switched to RUNNING asap. Because the new checkpoint is
> unable to be triggered during INITIALIZING. If the job is switched to
> RUNNING, the new unaligned checkpoint can be made.
> h2. Solution:
> In brief:
> # The task is switched to RUNNING after all input buffers are added to
> RecoveredInputChannel.
> ** In general, it's quick unless the network buffer isn't enough.
> ** When the network buffer isn't enough, it still needs to wait for some
> buffers are released. (Buffer will be released after a part of data are
> processed.)
> # RecoveredInputChannel supports snapshot for network buffers
>
> Additional improvement:
> * RecoveredInputChannel only requests the ExclusiveBuffers, and doesn't
> request the floating buffers.
> * It cause the network buffer isn't enough for RecoveredInputChannel if the
> floating buffer is used for old job that creating this checkpoint.
> * We could let RecoveredInputChannel support request floating buffer in
> other Jira if this optimization makes sense.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)