[
https://issues.apache.org/jira/browse/FLINK-27155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524857#comment-17524857
]
Roman Khachatryan commented on FLINK-27155:
-------------------------------------------
I think we should consider widening the scope of the cache, because a single
file may be used by multiple tasks.
However, with such caching we'll have to deal with concurrent file acceses. If
we assume that IO is the bottleneck, then the easiest and most efficient way to
address it is to serialize all these accesses.
Another problem is reading the file out-of-order (leading to random IO). It can
be solved by collecting the offsets from all tasks and then passing them to the
reader (along with file names and callbacks). This can be done separately (as
the next step) once the same file is read from one place.
Code-wise, StateChangelogStorage, has exactly this scope: it is created on TM
per job, and then used for all tasks of this job.
Similar to how multiple FsStateChangelogWriter share the same
StateChangeUploadScheduler, readers can share some cache manager.
To cleanup the cache, I think it's better to use something like reference
counting:
* increment on 1st use of file or just changelogStorage.createReader()
* decrement on reader.close(), at the end of backend recovery
Using (only) timeouts can be less efficient (files kept unncecessarily long,
which is important because of space amplification with changelog) or unstable
with small timeouts.
WDYT?
> Reduce multiple reads to the same Changelog file in the same taskmanager
> during restore
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-27155
> URL: https://issues.apache.org/jira/browse/FLINK-27155
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / State Backends
> Reporter: Feifan Wang
> Assignee: Feifan Wang
> Priority: Major
> Fix For: 1.16.0
>
>
> h3. Background
> In the current implementation, State changes of different operators in the
> same taskmanager may be written to the same changelog file, which effectively
> reduces the number of files and requests to DFS.
> But on the other hand, the current implementation also reads the same
> changelog file multiple times on recovery. More specifically, the number of
> times the same changelog file is accessed is related to the number of
> ChangeSets contained in it. And since each read needs to skip the preceding
> bytes, this network traffic is also wasted.
> The result is a lot of unnecessary request to DFS when there are multiple
> slots and keyed state in the same taskmanager.
> h3. Proposal
> We can reduce multiple reads to the same changelog file in the same
> taskmanager during restore.
> One possible approach is to read the changelog file all at once and cache it
> in memory or local file for a period of time when reading the changelog file.
> I think this could be a subtask of [v2 FLIP-158: Generalized incremental
> checkpoints|https://issues.apache.org/jira/browse/FLINK-25842] .
> Hi [~ym] , [~roman] how do you think about ?
--
This message was sent by Atlassian Jira
(v8.20.7#820007)