Zakelly Lan created FLINK-32070:
-----------------------------------
Summary: FLIP-306 Unified File Merging Mechanism for Checkpoints
Key: FLINK-32070
URL: https://issues.apache.org/jira/browse/FLINK-32070
Project: Flink
Issue Type: New Feature
Reporter: Zakelly Lan
Assignee: Zakelly Lan
Fix For: 1.18.0
The FLIP:
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints]
The creation of multiple checkpoint files can lead to a 'file flood' problem,
in which a large number of files are written to the checkpoint storage in a
short amount of time. This can cause issues in large clusters with high
workloads, such as the creation and deletion of many files increasing the
amount of file meta modification on DFS, leading to single-machine hotspot
issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the
performance of object storage (e.g. Amazon S3 and Alibaba OSS) can
significantly decrease when listing objects, which is necessary for object name
de-duplication before creating an object, further affecting the performance of
directory manipulation in the file system's perspective of view (See
[hadoop-aws module
documentation|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#:~:text=an%20intermediate%20state.-,Warning%20%232%3A%20Directories%20are%20mimicked,-The%20S3A%20clients],
section 'Warning #2: Directories are mimicked').
While many solutions have been proposed for individual types of state files
(e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel state),
the file flood problems from each type of checkpoint file are similar and lack
systematic view and solution. Therefore, the goal of this FLIP is to establish
a unified file merging mechanism to address the file flood problem during
checkpoint creation for all types of state files, including keyed, non-keyed,
channel, and changelog state. This will significantly improve the system
stability and availability of fault tolerance in Flink.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)