Robert Metzger created FLINK-22483:
--------------------------------------
Summary: Recover checkpoints when JobMaster gains leadership
Key: FLINK-22483
URL: https://issues.apache.org/jira/browse/FLINK-22483
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
Fix For: 1.14.0
Recovering checkpoints (from the CompletedCheckpointStore) is a potentially
blocking operation, for example if the file system implementation is retrying
to connect to a unavailable storage backend.
Currently, we are calling the CompletedCheckpointStore.recover() method from
the main thread of the JobManager, making it unresponsive to any RPC call while
the recover method is blocked.
By moving the recovery to the start of the JobManager (which happens
asynchronously after the JobMaster has gained leadership), Flink will remain
responsive (reporting a job in INITIALIZING state).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)