[
https://issues.apache.org/jira/browse/FLINK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522333#comment-17522333
]
Anton Kalashnikov commented on FLINK-26803:
-------------------------------------------
Yep, I know that `state.storage.fs.memory-threshold` is just 20KB but I meant
that you can try to increase it as much as you need and compare it with the
current scenario. It obviously has overhead with sending information to
JobManager but maybe it is still will be better for you. (But of course, if you
have a lot of TaskManagers and parallelism, perhaps it is still won't be
possible to write everything to one file)
I will take a look deeply at your idea later. But just take into account that
right now we have unsynchronized `AsyncCheckpointRunnable` for every subtask
which is not a problem since we write to the different files. But if we want to
write to one(several) files we need to reimplement it somehow(synchronization
or `AsyncCheckpointRunnable` per file).
> Merge small ChannelState file for Unaligned Checkpoint
> ------------------------------------------------------
>
> Key: FLINK-26803
> URL: https://issues.apache.org/jira/browse/FLINK-26803
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing, Runtime / Network
> Reporter: fanrui
> Priority: Major
>
> When making an unaligned checkpoint, the number of ChannelState files is
> TaskNumber * subtaskNumber. For high parallelism job, it writes too many
> small files. It causes high load for hdfs NN.
>
> In our production, a job writes more than 50K small files for each Unaligned
> Checkpoint. Could we merge these files before write FileSystem? We can
> configure the maximum number of files each TM can write in a single Unaligned
> Checkpoint.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)