[
https://issues.apache.org/jira/browse/FLINK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arvid Heise reassigned FLINK-22792:
-----------------------------------
Assignee: Tianxin Zhao
> Limit size of already processed files in File Source SplitEnumerator
> --------------------------------------------------------------------
>
> Key: FLINK-22792
> URL: https://issues.apache.org/jira/browse/FLINK-22792
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Reporter: Tianxin Zhao
> Assignee: Tianxin Zhao
> Priority: Major
>
> File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files
> in selected file system. Task inside the SplitEnumerator periodically lists
> given path and creates splits from the path. To avoid splits getting
> reprocessed, currently all processed paths is recorded in the set
> {{pathsAlreadyProcessed}}. However, this set could grow indefinitely with new
> files added to the input path and eventually result in out of memory issue.
> (Original PR: [https://github.com/apache/flink/pull/13401])
> This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a
> configurable SLA such that files older than some (watermark - SLA) would be
> ignored to be processed and also cleaned up from the
> {{pathsAlreadyProcessed}} set. Watermark is decided based on the minimum
> modification time of unprocessed files. {{pathsAlreadyProcessed}} set would
> be cleaned up during every snapshot.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)