ming li created FLINK-24086:
-------------------------------
Summary: Do not re-register SharedStateRegistry to reduce the
recovery time of the job
Key: FLINK-24086
URL: https://issues.apache.org/jira/browse/FLINK-24086
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Reporter: ming li
At present, we only recover the {{CompletedCheckpointStore}} when the
{{JobManager}} starts, so it seems that we do not need to re-register the
{{SharedStateRegistry}} when the task restarts.
The reason for this issue is that in our production environment, we discard
part of the data and state to only restart the failed task, but found that it
may take several seconds to register the {{SharedStateRegistry}} (thousands of
tasks and dozens of TB states). When there are a large number of task failures
at the same time, this may take several minutes (number of tasks * several
seconds).
Therefore, if the {{SharedStateRegistry}} can be reused, the time for task
recovery can be reduced.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)