[
https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408273#comment-17408273
]
ming li edited comment on FLINK-24086 at 9/1/21, 4:49 PM:
----------------------------------------------------------
{quote}We implemented a new failover strategy (by discarding some data to only
restart failed tasks)
{quote}
Um... This can be ignored. It can be considered that the job has a full restart
and is restored from the checkpoint.
{quote}But now we don't restore CompleteCheckpointStore again, this problem
will no longer exist
{quote}
According to the issue of FLINK-22483, we will not recover the
{{CompletedCheckpointStore }}every time. Therefore, if we reuse the same{{
SharedStateRegistry}} during restore and do not clear it, asynchronous deletion
will not cause the reference count of {{SharedState}} to be less than 1. So,
this can reduce the recovery time.
was (Author: ming li):
{quote}We implemented a new failover strategy (by discarding some data to only
restart failed tasks)
{quote}
Um... This can be ignored. It can be considered that the job has a full restart
and is restored from the checkpoint.
{quote}But now we don't restore CompleteCheckpointStore again, this problem
will no longer exist
{quote}
According to the issue of FLINK-22483, we will not recover the
{{CompletedCheckpointStore }}every time. Therefore, if we reuse the same
{{SharedStateRegistry}} during restore and do not clear it, asynchronous
deletion will not cause the reference count of {{SharedState}} to be less than
1. So, this can reduce the recovery time.
> Do not re-register SharedStateRegistry to reduce the recovery time of the job
> -----------------------------------------------------------------------------
>
> Key: FLINK-24086
> URL: https://issues.apache.org/jira/browse/FLINK-24086
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Reporter: ming li
> Priority: Major
>
> At present, we only recover the {{CompletedCheckpointStore}} when the
> {{JobManager}} starts, so it seems that we do not need to re-register the
> {{SharedStateRegistry}} when the task restarts.
> The reason for this issue is that in our production environment, we discard
> part of the data and state to only restart the failed task, but found that it
> may take several seconds to register the {{SharedStateRegistry}} (thousands
> of tasks and dozens of TB states). When there are a large number of task
> failures at the same time, this may take several minutes (number of tasks *
> several seconds).
> Therefore, if the {{SharedStateRegistry}} can be reused, the time for task
> recovery can be reduced.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)