Congxian Qiu(klion26) created FLINK-11937:
---------------------------------------------
Summary: Resolve small file problem in RocksDB incremental
checkpoint
Key: FLINK-11937
URL: https://issues.apache.org/jira/browse/FLINK-11937
Project: Flink
Issue Type: New Feature
Components: Runtime / Checkpointing
Affects Versions: 1.7.2
Reporter: Congxian Qiu(klion26)
Assignee: Congxian Qiu(klion26)
Currently when incremental checkpoint is enabled in RocksDBStateBackend a
separate file will be generated on DFS for each sst file. This may cause “file
flood” when running intensive workload (many jobs with high parallelism) in big
cluster. According to our observation in Alibaba production, such file flood
introduces at lease two drawbacks when using HDFS as the checkpoint storage
FileSystem: 1) huge number of RPC request issued to NN which may burst its
response queue; 2) huge number of files causes big pressure on NN’s on-heap
memory.
In Flink we ever noticed similar small file flood problem and tried to resolved
it by introducing ByteStreamStateHandle(FLINK-2818), but this solution has its
limitation that if we configure the threshold too low there will still be too
many small files, while if too high the JM will finally OOM, thus could hardly
resolve the issue in case of using RocksDBStateBackend with incremental
snapshot strategy.
We propose a new OutputStream called
FileSegmentCheckpointStateOutputStream(FSCSOS) to fix the problem. FSCSOS will
reuse the same underlying distributed file until its size exceeds a preset
threshold. We
plan to complete the work in 3 steps: firstly introduce FSCSOS, secondly
resolve the specific storage amplification issue on FSCSOS, and lastly add an
option to reuse FSCSOS across multiple checkpoints to further reduce the DFS
file number.
More details please refer to the attached design doc.
[1]
[https://www.slideshare.net/dataArtisans/stephan-ewen-experiences-running-flink-at-very-large-scale]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)