Tianxin Zhao created FLINK-22792:
------------------------------------

             Summary: Limit size of already processed files in File Source 
SplitEnumerator
                 Key: FLINK-22792
                 URL: https://issues.apache.org/jira/browse/FLINK-22792
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / FileSystem
            Reporter: Tianxin Zhao


File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files in 
selected file system. Task inside the SplitEnumerator periodically lists given 
path and creates splits from the path. To avoid splits getting reprocessed, 
currently all processed paths is recorded in the set {{pathsAlreadyProcessed}}. 
However, this set could grow indefinitely with new files added to the input 
path and eventually result in out of memory issue. (Original PR: 
[https://github.com/apache/flink/pull/13401])

This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a 
configurable SLA such that files older than some (watermark - SLA) would be 
ignored to be processed and also cleaned up from the {{pathsAlreadyProcessed}} 
set. Watermark is decided based on the minimum modification time of unprocessed 
files. {{pathsAlreadyProcessed}} set would be cleaned up during every snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to