[ 
https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410758#comment-17410758
 ] 

Piotr Nowojski commented on FLINK-24086:
----------------------------------------

I had to think for a while about this, but I think you are right. There is no 
reason to recreate {{SharedStateRegistry}} per every job restore. The contract 
should be that {{SharedStateRegistry}} shouldn't out live the 
{{CompleteCheckpointStore}}.

[~Ming Li], would you like to work on this issue? I'm still not 100% sure if 
there are not other issues though. I'm also not sure how big this would be. If 
it's a large change, I think we would need a design doc. If it's a simple 
change, maybe a discussion over a draft PR would be easier?

> Do not re-register SharedStateRegistry to reduce the recovery time of the job
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-24086
>                 URL: https://issues.apache.org/jira/browse/FLINK-24086
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0
>            Reporter: ming li
>            Priority: Major
>
> At present, we only recover the {{CompletedCheckpointStore}} when the 
> {{JobManager}} starts, so it seems that we do not need to re-register the 
> {{SharedStateRegistry}} when the task restarts.
> The reason for this issue is that in our production environment, we discard 
> part of the data and state to only restart the failed task, but found that it 
> may take several seconds to register the {{SharedStateRegistry}} (thousands 
> of tasks and dozens of TB states). When there are a large number of task 
> failures at the same time, this may take several minutes (number of tasks * 
> several seconds).
> Therefore, if the {{SharedStateRegistry}} can be reused, the time for task 
> recovery can be reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to