Zakelly Lan created FLINK-32070: ----------------------------------- Summary: FLIP-306 Unified File Merging Mechanism for Checkpoints Key: FLINK-32070 URL: https://issues.apache.org/jira/browse/FLINK-32070 Project: Flink Issue Type: New Feature Reporter: Zakelly Lan Assignee: Zakelly Lan Fix For: 1.18.0
The FLIP: [https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints] The creation of multiple checkpoint files can lead to a 'file flood' problem, in which a large number of files are written to the checkpoint storage in a short amount of time. This can cause issues in large clusters with high workloads, such as the creation and deletion of many files increasing the amount of file meta modification on DFS, leading to single-machine hotspot issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the performance of object storage (e.g. Amazon S3 and Alibaba OSS) can significantly decrease when listing objects, which is necessary for object name de-duplication before creating an object, further affecting the performance of directory manipulation in the file system's perspective of view (See [hadoop-aws module documentation|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#:~:text=an%20intermediate%20state.-,Warning%20%232%3A%20Directories%20are%20mimicked,-The%20S3A%20clients], section 'Warning #2: Directories are mimicked'). While many solutions have been proposed for individual types of state files (e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel state), the file flood problems from each type of checkpoint file are similar and lack systematic view and solution. Therefore, the goal of this FLIP is to establish a unified file merging mechanism to address the file flood problem during checkpoint creation for all types of state files, including keyed, non-keyed, channel, and changelog state. This will significantly improve the system stability and availability of fault tolerance in Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)