[jira] [Commented] (FLINK-26803) Merge small ChannelState file for Unaligned Checkpoint

Anton Kalashnikov (Jira) Thu, 14 Apr 2022 07:13:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522333#comment-17522333
 ]


Anton Kalashnikov commented on FLINK-26803:
-------------------------------------------

Yep, I know that `state.storage.fs.memory-threshold` is just 20KB but I meant 
that you can try to increase it as much as you need and compare it with the 
current scenario. It obviously has overhead with sending information to 
JobManager but maybe it is still will be better for you. (But of course, if you 
have a lot of TaskManagers and parallelism, perhaps it is still won't be 
possible to write everything to one file)

I will take a look deeply at your idea later. But just take into account that 
right now we have unsynchronized `AsyncCheckpointRunnable` for every subtask 
which is not a problem since we write to the different files. But if we want to 
write to one(several) files we need to reimplement it somehow(synchronization 
or `AsyncCheckpointRunnable` per file).

> Merge small ChannelState file for Unaligned Checkpoint
> ------------------------------------------------------
>
>                 Key: FLINK-26803
>                 URL: https://issues.apache.org/jira/browse/FLINK-26803
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Network
>            Reporter: fanrui
>            Priority: Major
>
> When making an unaligned checkpoint, the number of ChannelState files is 
> TaskNumber * subtaskNumber. For high parallelism job, it writes too many 
> small files. It causes high load for hdfs NN.
>  
> In our production, a job writes more than 50K small files for each Unaligned 
> Checkpoint. Could we merge these files before write FileSystem? We can 
> configure the maximum number of files each TM can write in a single Unaligned 
> Checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26803) Merge small ChannelState file for Unaligned Checkpoint

Reply via email to