Jiayi Liao created FLINK-19596:
----------------------------------

             Summary: Do not recover CompletedCheckpointStore on each failover
                 Key: FLINK-19596
                 URL: https://issues.apache.org/jira/browse/FLINK-19596
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
    Affects Versions: 1.11.2
            Reporter: Jiayi Liao


{{completedCheckpointStore.recover()}} in 
{{restoreLatestCheckpointedStateInternal}} could be a bottleneck on failover 
because the {{CompletedCheckpointStore}} needs to load HDFS files to 
instantialize the {{CompleteCheckpoint}} instances.

The impact is significant in our case below:

* Jobs with high parallelism (no shuffle) which transfer data from Kafka to 
other filesystems.
* If a machine goes down, several containers and tens of tasks are affected, 
which means the {{completedCheckpointStore.recover()}} would be called tens of 
times since the tasks are not in a failover region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to