[jira] [Commented] (FLINK-24086) Do not re-register SharedStateRegistry to reduce the recovery time of the job

Dawid Wysakowicz (Jira) Fri, 10 Sep 2021 03:13:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413089#comment-17413089
 ]


Dawid Wysakowicz commented on FLINK-24086:
------------------------------------------

Hey [~Ming Li] a little disclaimer first, I am quite new to the component so I 
might've not grasped everything.

1. I think the assumption is right that we can bind the lifecycle of a 
{{SharedStateRegistry}} with  a {{CompleteCheckpointStore}}.
2. However, I don't think we can do that within a {{CheckpointCoordinator}}. If 
I understand the FLINK-22483 it is not as easy as saying we will only ever 
create the {{CompletedCheckpointStore}} once. The logic if we do or do not is 
implemented in the 
{{CheckpointRecoveryFactory#createRecoveredCompletedCheckpointStore}} and it 
reuses old {{CompleteCheckpointStore}} only if 
{{PerJobCheckpointRecoveryFactory}} is used. Even if that's the only case where 
the {{CheckpointCoordinator}} might outlive a failure (I honestly don't know if 
it is true or not), still I would not feel safe to build around such an 
indirect contract. I'd rather move the recovery of a {{SharedStateRegistry}} 
closer to where the {{CompletedCheckpointStore}} is created/restored. One idea 
I have is to move it to the {{CheckpointRecoveryFactory}}.

> Do not re-register SharedStateRegistry to reduce the recovery time of the job
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-24086
>                 URL: https://issues.apache.org/jira/browse/FLINK-24086
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Coordination
>            Reporter: ming li
>            Assignee: ming li
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> At present, we only recover the {{CompletedCheckpointStore}} when the 
> {{JobManager}} starts, so it seems that we do not need to re-register the 
> {{SharedStateRegistry}} when the task restarts.
> The reason for this issue is that in our production environment, we discard 
> part of the data and state to only restart the failed task, but found that it 
> may take several seconds to register the {{SharedStateRegistry}} (thousands 
> of tasks and dozens of TB states). When there are a large number of task 
> failures at the same time, this may take several minutes (number of tasks * 
> several seconds).
> Therefore, if the {{SharedStateRegistry}} can be reused, the time for task 
> recovery can be reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24086) Do not re-register SharedStateRegistry to reduce the recovery time of the job

Reply via email to