Feifan Wang created FLINK-28172:
-----------------------------------
Summary: Scatter dstl files into separate directories by job id
Key: FLINK-28172
URL: https://issues.apache.org/jira/browse/FLINK-28172
Project: Flink
Issue Type: Improvement
Components: Runtime / State Backends
Affects Versions: 1.15.0
Reporter: Feifan Wang
In the current implementation of {_}FsStateChangelogStorage{_}, dstl files from
all jobs are put into the same directory (configured via
{_}dstl.dfs.base-path{_}). Everything is fine if it's a filesystem like S3.But
if it is a file system like hadoop, there will be some problems.
First, there may be an upper limit to the number of files in a single
directory. Increasing this threshold will greatly reduce the performance of the
distributed file system.
Second, dstl file management becomes difficult because the user cannot tell
which job the dstl file belongs to, especially when the retained checkpoint is
turned on.
h3. Propose
# create a subdirectory named with the job id under the _dstl.dfs.base-path_
directory when the job starts
# all dstl files upload to the subdirectory
( Going a step further, we can even create two levels of subdirectories under
the _dstl.dfs.base-path_ directory, like _base-path/\{jobId}/dstl ._ This way,
if the user configures the same dstl.dfs.base-path as state.checkpoints.dir,
all files needed for job recovery will be in the same directory and well
organized. )
--
This message was sent by Atlassian Jira
(v8.20.7#820007)