[
https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413089#comment-17413089
]
Dawid Wysakowicz commented on FLINK-24086:
------------------------------------------
Hey [~Ming Li] a little disclaimer first, I am quite new to the component so I
might've not grasped everything.
1. I think the assumption is right that we can bind the lifecycle of a
{{SharedStateRegistry}} with a {{CompleteCheckpointStore}}.
2. However, I don't think we can do that within a {{CheckpointCoordinator}}. If
I understand the FLINK-22483 it is not as easy as saying we will only ever
create the {{CompletedCheckpointStore}} once. The logic if we do or do not is
implemented in the
{{CheckpointRecoveryFactory#createRecoveredCompletedCheckpointStore}} and it
reuses old {{CompleteCheckpointStore}} only if
{{PerJobCheckpointRecoveryFactory}} is used. Even if that's the only case where
the {{CheckpointCoordinator}} might outlive a failure (I honestly don't know if
it is true or not), still I would not feel safe to build around such an
indirect contract. I'd rather move the recovery of a {{SharedStateRegistry}}
closer to where the {{CompletedCheckpointStore}} is created/restored. One idea
I have is to move it to the {{CheckpointRecoveryFactory}}.
> Do not re-register SharedStateRegistry to reduce the recovery time of the job
> -----------------------------------------------------------------------------
>
> Key: FLINK-24086
> URL: https://issues.apache.org/jira/browse/FLINK-24086
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing, Runtime / Coordination
> Reporter: ming li
> Assignee: ming li
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.15.0
>
>
> At present, we only recover the {{CompletedCheckpointStore}} when the
> {{JobManager}} starts, so it seems that we do not need to re-register the
> {{SharedStateRegistry}} when the task restarts.
> The reason for this issue is that in our production environment, we discard
> part of the data and state to only restart the failed task, but found that it
> may take several seconds to register the {{SharedStateRegistry}} (thousands
> of tasks and dozens of TB states). When there are a large number of task
> failures at the same time, this may take several minutes (number of tasks *
> several seconds).
> Therefore, if the {{SharedStateRegistry}} can be reused, the time for task
> recovery can be reduced.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)