Rui Fan created FLINK-35761:
-------------------------------
Summary: Speed up the restore process of unaligned checkpoint
Key: FLINK-35761
URL: https://issues.apache.org/jira/browse/FLINK-35761
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Affects Versions: 1.19.1, 1.20.0
Reporter: Rui Fan
Assignee: Rui Fan
Currently, the task will transition state from ExecutionState.INITIALIZING to
ExecutionState.RUNNING after all input buffers are processed.
It will cause the restore time is very long if the performance is not strong
and unaligned checkpoint snapshotted too many input buffers. From my
experience, the restore time will excess 30 minutes when job with high
parallelism.
We hope the job is switched to RUNNING asap. Because the new checkpoint is
unable to be triggered during INITIALIZING. If the job is switched to RUNNING,
the new unaligned checkpoint can be made.
h2. Brief Solution:
# The task is switched to RUNNING after all input buffers are added to
RecoveredInputChannel.
** In general, it's quick unless the network buffer isn't enough.
# RecoveredInputChannel supports snapshot for network buffers
--
This message was sent by Atlassian Jira
(v8.20.10#820010)