[
https://issues.apache.org/jira/browse/FLINK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fan Hong updated FLINK-31809:
-----------------------------
Description:
In the current implementation of {{{}ListStateWithCache{}}}, the
{{snapshotState}} function writes the full data to the file system every time,
even if the stored data has not changed since initialization. This can result
in high IO costs, especially when working with large data sets. Additionally,
this method is called in the same thread as operators, which can negatively
impact job efficiency.
Furthermore, when using local file systems, the full data is also written to
Flink state storage, which doubles the costs.
To address these issues, an incremental snapshot approach should be considered
to reduce IO and network costs.
was:
Current `ListStateWithCache#snapshotState` supports distributed file systems
and local file systems. However, in both cases, full data is written to the
filesystem (`
dataCacheWriter.writeSegmentsToFiles()`) when `snapshotState` is called.
Moreover, when local file system is used, full data is written to Flink state
storage right now, which doubles the costs.
> Improve efficiency of ListStateWithCache#snapshotState
> ------------------------------------------------------
>
> Key: FLINK-31809
> URL: https://issues.apache.org/jira/browse/FLINK-31809
> Project: Flink
> Issue Type: Improvement
> Components: Library / Machine Learning
> Reporter: Fan Hong
> Priority: Major
>
> In the current implementation of {{{}ListStateWithCache{}}}, the
> {{snapshotState}} function writes the full data to the file system every
> time, even if the stored data has not changed since initialization. This can
> result in high IO costs, especially when working with large data sets.
> Additionally, this method is called in the same thread as operators, which
> can negatively impact job efficiency.
> Furthermore, when using local file systems, the full data is also written to
> Flink state storage, which doubles the costs.
> To address these issues, an incremental snapshot approach should be
> considered to reduce IO and network costs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)