Jiayi Liao created FLINK-19596: ---------------------------------- Summary: Do not recover CompletedCheckpointStore on each failover Key: FLINK-19596 URL: https://issues.apache.org/jira/browse/FLINK-19596 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.11.2 Reporter: Jiayi Liao
{{completedCheckpointStore.recover()}} in {{restoreLatestCheckpointedStateInternal}} could be a bottleneck on failover because the {{CompletedCheckpointStore}} needs to load HDFS files to instantialize the {{CompleteCheckpoint}} instances. The impact is significant in our case below: * Jobs with high parallelism (no shuffle) which transfer data from Kafka to other filesystems. * If a machine goes down, several containers and tens of tasks are affected, which means the {{completedCheckpointStore.recover()}} would be called tens of times since the tasks are not in a failover region. -- This message was sent by Atlassian Jira (v8.3.4#803005)