[
https://issues.apache.org/jira/browse/FLINK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tianxin Zhao updated FLINK-22792:
---------------------------------
Description:
File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files in
selected file system. Task inside the SplitEnumerator periodically lists given
path and creates splits from the path. To avoid splits getting reprocessed,
currently all processed paths is recorded in the set {{pathsAlreadyProcessed}}.
However, this set could grow indefinitely with new files added to the input
path and eventually result in out of memory issue. (Original PR:
[https://github.com/apache/flink/pull/13401])
This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a
configurable SLA such that files older than some (watermark - SLA) would be
ignored to be processed and also cleaned up from the {{pathsAlreadyProcessed}}
set. Watermark is decided based on the minimum modification time of unprocessed
files. {{pathsAlreadyProcessed}} set would be cleaned up during every snapshot.
Deduplication
was:
File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files in
selected file system. Task inside the SplitEnumerator periodically lists given
path and creates splits from the path. To avoid splits getting reprocessed,
currently all processed paths is recorded in the set {{pathsAlreadyProcessed}}.
However, this set could grow indefinitely with new files added to the input
path and eventually result in out of memory issue. (Original PR:
[https://github.com/apache/flink/pull/13401])
This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a
configurable SLA such that files older than some (watermark - SLA) would be
ignored to be processed and also cleaned up from the {{pathsAlreadyProcessed}}
set. Watermark is decided based on the minimum modification time of unprocessed
files. {{pathsAlreadyProcessed}} set would be cleaned up during every snapshot.
> Limit size of already processed files in File Source SplitEnumerator
> --------------------------------------------------------------------
>
> Key: FLINK-22792
> URL: https://issues.apache.org/jira/browse/FLINK-22792
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Reporter: Tianxin Zhao
> Assignee: Tianxin Zhao
> Priority: Major
>
> File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files
> in selected file system. Task inside the SplitEnumerator periodically lists
> given path and creates splits from the path. To avoid splits getting
> reprocessed, currently all processed paths is recorded in the set
> {{pathsAlreadyProcessed}}. However, this set could grow indefinitely with new
> files added to the input path and eventually result in out of memory issue.
> (Original PR: [https://github.com/apache/flink/pull/13401])
> This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a
> configurable SLA such that files older than some (watermark - SLA) would be
> ignored to be processed and also cleaned up from the
> {{pathsAlreadyProcessed}} set. Watermark is decided based on the minimum
> modification time of unprocessed files. {{pathsAlreadyProcessed}} set would
> be cleaned up during every snapshot.
> Deduplication
--
This message was sent by Atlassian Jira
(v8.3.4#803005)