[
https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413232#comment-17413232
]
ming li commented on FLINK-24086:
---------------------------------
Hi, [~dwysakowicz], Here I have a question:
{quote}it reuses old {{CompleteCheckpointStore}} only if
{{PerJobCheckpointRecoveryFactory}} is used.
{quote}
I see in the code that {{PerJobCheckpointRecoveryFactory}} actually does not
reuse the previous {{CompletedCheckpointStore}}, but recreates a new
{{EmbeddedCompletedCheckpointStore}}. I am not sure if I am right, please
remind me if I am wrong.
It seems that we now all agree to register {{SharedState}} only after
{{CompleteCheckpointStore}} recovers.
I think you are right. It is a better choice to move the process of registering
{{SharedState}} to {{CheckpointRecoveryFactory}}, which can reduce the changes
to {{CompletedCheckpointStore}}.
> Do not re-register SharedStateRegistry to reduce the recovery time of the job
> -----------------------------------------------------------------------------
>
> Key: FLINK-24086
> URL: https://issues.apache.org/jira/browse/FLINK-24086
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing, Runtime / Coordination
> Reporter: ming li
> Assignee: ming li
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.15.0
>
>
> At present, we only recover the {{CompletedCheckpointStore}} when the
> {{JobManager}} starts, so it seems that we do not need to re-register the
> {{SharedStateRegistry}} when the task restarts.
> The reason for this issue is that in our production environment, we discard
> part of the data and state to only restart the failed task, but found that it
> may take several seconds to register the {{SharedStateRegistry}} (thousands
> of tasks and dozens of TB states). When there are a large number of task
> failures at the same time, this may take several minutes (number of tasks *
> several seconds).
> Therefore, if the {{SharedStateRegistry}} can be reused, the time for task
> recovery can be reduced.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)