[ 
https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679874#comment-17679874
 ] 

Cliff Resnick commented on FLINK-25672:
---------------------------------------

I imagine it may require a breaking change from what I can tell by the design, 
with the stateless factory fro the Enumerator. But I will be looking at it.

> FileSource enumerator remembers paths of all already processed files which 
> can result in large state
> ----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25672
>                 URL: https://issues.apache.org/jira/browse/FLINK-25672
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: Martijn Visser
>            Priority: Major
>
> As mentioned in the Filesystem documentation, for Unbounded File Sources, the 
> {{FileEnumerator}} currently remembers paths of all already processed files, 
> which is a state that can in come cases grow rather large. 
> We should look into possibilities to reduce this. We could look into adding a 
> compressed form of tracking already processed files (for example by keeping 
> modification timestamps lower boundaries).
> When fixed, this should also be reflected in the documentation, as mentioned 
> in https://github.com/apache/flink/pull/18288#discussion_r785707311



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to