Zakelly Lan created FLINK-32070:
-----------------------------------

             Summary: FLIP-306 Unified File Merging Mechanism for Checkpoints
                 Key: FLINK-32070
                 URL: https://issues.apache.org/jira/browse/FLINK-32070
             Project: Flink
          Issue Type: New Feature
            Reporter: Zakelly Lan
            Assignee: Zakelly Lan
             Fix For: 1.18.0


The FLIP: 
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints]

 

The creation of multiple checkpoint files can lead to a 'file flood' problem, 
in which a large number of files are written to the checkpoint storage in a 
short amount of time. This can cause issues in large clusters with high 
workloads, such as the creation and deletion of many files increasing the 
amount of file meta modification on DFS, leading to single-machine hotspot 
issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the 
performance of object storage (e.g. Amazon S3 and Alibaba OSS) can 
significantly decrease when listing objects, which is necessary for object name 
de-duplication before creating an object, further affecting the performance of 
directory manipulation in the file system's perspective of view (See 
[hadoop-aws module 
documentation|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#:~:text=an%20intermediate%20state.-,Warning%20%232%3A%20Directories%20are%20mimicked,-The%20S3A%20clients],
 section 'Warning #2: Directories are mimicked').

While many solutions have been proposed for individual types of state files 
(e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel state), 
the file flood problems from each type of checkpoint file are similar and lack 
systematic view and solution. Therefore, the goal of this FLIP is to establish 
a unified file merging mechanism to address the file flood problem during 
checkpoint creation for all types of state files, including keyed, non-keyed, 
channel, and changelog state. This will significantly improve the system 
stability and availability of fault tolerance in Flink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to