[jira] [Updated] (FLINK-24086) Do not re-register SharedStateRegistry to reduce the recovery time of the job

ming li (Jira) Tue, 31 Aug 2021 07:19:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ming li updated FLINK-24086:
----------------------------
    Description: 
At present, we only recover the {{CompletedCheckpointStore}} when the 
{{JobManager}} starts, so it seems that we do not need to re-register the 
{{SharedStateRegistry}} when the task restarts.

The reason for this issue is that in our production environment, we discard 
part of the data and state to only restart the failed task, but found that it 
may take several seconds to register the {{SharedStateRegistry}} (thousands of 
tasks and dozens of TB states). When there are a large number of task failures 
at the same time, this may take several minutes (number of tasks * several 
seconds).

Therefore, if the {{SharedStateRegistry}} can be reused, the time for task 
recovery can be reduced.

  was:
At present, we only recover the {{CompletedCheckpointStore}} when the 
{{JobManager}} starts, so it seems that we do not need to re-register the 
{{SharedStateRegistry}} when the task restarts.


The reason for this issue is that in our production environment, we discard 
part of the data and state to only restart the failed task, but found that it 
may take several seconds to register the {{SharedStateRegistry}} (thousands of 
tasks and dozens of TB states). When there are a large number of task failures 
at the same time, this may take several minutes (number of tasks * several 
seconds).

 

Therefore, if the {{SharedStateRegistry}} can be reused, the time for task 
recovery can be reduced.


> Do not re-register SharedStateRegistry to reduce the recovery time of the job
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-24086
>                 URL: https://issues.apache.org/jira/browse/FLINK-24086
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: ming li
>            Priority: Major
>
> At present, we only recover the {{CompletedCheckpointStore}} when the 
> {{JobManager}} starts, so it seems that we do not need to re-register the 
> {{SharedStateRegistry}} when the task restarts.
> The reason for this issue is that in our production environment, we discard 
> part of the data and state to only restart the failed task, but found that it 
> may take several seconds to register the {{SharedStateRegistry}} (thousands 
> of tasks and dozens of TB states). When there are a large number of task 
> failures at the same time, this may take several minutes (number of tasks * 
> several seconds).
> Therefore, if the {{SharedStateRegistry}} can be reused, the time for task 
> recovery can be reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-24086) Do not re-register SharedStateRegistry to reduce the recovery time of the job

Reply via email to