[
https://issues.apache.org/jira/browse/FLINK-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166113#comment-17166113
]
Guowei Ma commented on FLINK-9940:
----------------------------------
I read the pr. I think the pr wants to store a set of files for avoiding
missing some out-of-order files. I think there are two problems:
# The memory might be bloomed up operator's the memory (Thanks to [~kkl0u] )
# This method could not resolve all the problems. The memory is always limited.
A rough idea would be that we could materialize a directory meta, which
describes all files we deal with at the current checkpoint. We should
periodically check if there are some missing files we need to deal with. (I
assume that the file would not be deleted by the external system and might add
a new same name file) What do you think?
> File source continuous monitoring mode: S3 files sometimes missed
> -----------------------------------------------------------------
>
> Key: FLINK-9940
> URL: https://issues.apache.org/jira/browse/FLINK-9940
> Project: Flink
> Issue Type: Bug
> Components: API / DataStream
> Affects Versions: 1.5.1
> Environment: Flink 1.5, EMRFS
> Reporter: Huyen Levan
> Assignee: Huyen Levan
> Priority: Major
> Labels: EMRFS, Flink, S3, pull-request-available
>
> When using StreamExecutionEnvironment.readFile() with
> FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if
> there is a high amount of new/modified files at the same time, the directory
> monitoring process might miss some files. The number of missing files depends
> on the monitoring interval.
> Cause: Flink tracks which files it has read by remembering the modification
> time of the file that was added (or modified) last. So when there are
> multiple files having a same last-modified timestamp.
> Suggested solution (thanks to [[Fabian
> Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]):
> a hybrid approach that keeps the names of all files that have a mod
> timestamp that is larger than the max mod time minus an offset.
> _org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_
--
This message was sent by Atlassian Jira
(v8.3.4#803005)