[
https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679868#comment-17679868
]
Cliff Resnick commented on FLINK-25672:
---------------------------------------
This issue has turned into a real problem for us with our transactional
datastream jobs. The problem is exacerbated by the the fact the the state is
not distributed, and instead localized to the job manager, which is rather ugly
in our HA K8s setup where we have a 16Gb limit in the common pool that our JMs
run in, and we are blowing past that simply with the un-evictable file path
history.
Is anyone looking into this?
> FileSource enumerator remembers paths of all already processed files which
> can result in large state
> ----------------------------------------------------------------------------------------------------
>
> Key: FLINK-25672
> URL: https://issues.apache.org/jira/browse/FLINK-25672
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Reporter: Martijn Visser
> Priority: Major
>
> As mentioned in the Filesystem documentation, for Unbounded File Sources, the
> {{FileEnumerator}} currently remembers paths of all already processed files,
> which is a state that can in come cases grow rather large.
> We should look into possibilities to reduce this. We could look into adding a
> compressed form of tracking already processed files (for example by keeping
> modification timestamps lower boundaries).
> When fixed, this should also be reflected in the documentation, as mentioned
> in https://github.com/apache/flink/pull/18288#discussion_r785707311
--
This message was sent by Atlassian Jira
(v8.20.10#820010)