[jira] [Updated] (FLINK-35761) Speed up the restore process of unaligned checkpoint

Rui Fan (Jira) Fri, 05 Jul 2024 01:11:07 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rui Fan updated FLINK-35761:
----------------------------
    Description: 
Currently, the task will transition state from ExecutionState.INITIALIZING to 
ExecutionState.RUNNING after all input buffers are processed when job restores 
from unaligned checkpoint.

It will cause the restore time is very long if the performance is not strong 
and unaligned checkpoint snapshotted too many input buffers. From my 
experience, the restore time will excess 30 minutes when job with high 
parallelism.

We hope the job is switched to RUNNING asap. Because the new checkpoint is 
unable to be triggered during INITIALIZING. If the job is switched to RUNNING, 
the new unaligned checkpoint can be made.
h2. Solution:

In brief:
 # The task is switched to RUNNING after all input buffers are added to 
RecoveredInputChannel.
 ** In general, it's quick unless the network buffer isn't enough.
 ** When the network buffer isn't enough, it still needs to wait for some 
buffers are released. (Buffer will be released after a part of data are 
processed.)
 # RecoveredInputChannel supports snapshot for network buffers

 

Additional issue:
 * RecoveredInputChannel only requests the ExclusiveBuffers, and doesn't 
request the floating buffers. It cause the buffer isn't enough for 
RecoveredInputChannel if the floating buffer is used for old job (The job that 
creating this checkpoint.)
 * We could let RecoveredInputChannel support request floating buffer in other 
Jira if this optimization makes sense.

 

  was:
Currently, the task will transition state from ExecutionState.INITIALIZING to 
ExecutionState.RUNNING after all input buffers are processed when job restores 
from unaligned checkpoint.

It will cause the restore time is very long if the performance is not strong 
and unaligned checkpoint snapshotted too many input buffers. From my 
experience, the restore time will excess 30 minutes when job with high 
parallelism.

We hope the job is switched to RUNNING asap. Because the new checkpoint is 
unable to be triggered during INITIALIZING. If the job is switched to RUNNING, 
the new unaligned checkpoint can be made.
h2. Solution:

In brief:
 # The task is switched to RUNNING after all input buffers are added to 
RecoveredInputChannel.
 ** In general, it's quick unless the network buffer isn't enough.
 ** When the network buffer isn't enough, it still needs to wait for some 
buffers are released. (Buffer will be released after a part of data are 
processed.)
 ** RecoveredInputChannel only requests the ExclusiveBuffers, and doesn't 
request the floating buffers. It cause the buffer isn't enough for 
RecoveredInputChannel if the floating buffer is used for old job (The job that 
creating this checkpoint.)
 # RecoveredInputChannel supports snapshot for network buffers

 


> Speed up the restore process of unaligned checkpoint
> ----------------------------------------------------
>
>                 Key: FLINK-35761
>                 URL: https://issues.apache.org/jira/browse/FLINK-35761
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.20.0, 1.19.1
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>
> Currently, the task will transition state from ExecutionState.INITIALIZING to 
> ExecutionState.RUNNING after all input buffers are processed when job 
> restores from unaligned checkpoint.
> It will cause the restore time is very long if the performance is not strong 
> and unaligned checkpoint snapshotted too many input buffers. From my 
> experience, the restore time will excess 30 minutes when job with high 
> parallelism.
> We hope the job is switched to RUNNING asap. Because the new checkpoint is 
> unable to be triggered during INITIALIZING. If the job is switched to 
> RUNNING, the new unaligned checkpoint can be made.
> h2. Solution:
> In brief:
>  # The task is switched to RUNNING after all input buffers are added to 
> RecoveredInputChannel.
>  ** In general, it's quick unless the network buffer isn't enough.
>  ** When the network buffer isn't enough, it still needs to wait for some 
> buffers are released. (Buffer will be released after a part of data are 
> processed.)
>  # RecoveredInputChannel supports snapshot for network buffers
>  
> Additional issue:
>  * RecoveredInputChannel only requests the ExclusiveBuffers, and doesn't 
> request the floating buffers. It cause the buffer isn't enough for 
> RecoveredInputChannel if the floating buffer is used for old job (The job 
> that creating this checkpoint.)
>  * We could let RecoveredInputChannel support request floating buffer in 
> other Jira if this optimization makes sense.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35761) Speed up the restore process of unaligned checkpoint

Reply via email to