[
https://issues.apache.org/jira/browse/FLINK-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201223#comment-17201223
]
Guowei Ma commented on FLINK-9940:
----------------------------------
Thanks [~Averell] and [~kkl0u] for your comments. Sorry for the late rely.
After thinking again I think it is a more realistic compromise. I just leave a
little concern about the pull request. We could discuss this at there.
> File source continuous monitoring mode: S3 files sometimes missed
> -----------------------------------------------------------------
>
> Key: FLINK-9940
> URL: https://issues.apache.org/jira/browse/FLINK-9940
> Project: Flink
> Issue Type: Bug
> Components: API / DataStream
> Affects Versions: 1.5.1
> Environment: Flink 1.5, EMRFS
> Reporter: Huyen Levan
> Assignee: Huyen Levan
> Priority: Major
> Labels: EMRFS, Flink, S3, pull-request-available
>
> When using StreamExecutionEnvironment.readFile() with
> FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if
> there is a high amount of new/modified files at the same time, the directory
> monitoring process might miss some files. The number of missing files depends
> on the monitoring interval.
> Cause: Flink tracks which files it has read by remembering the modification
> time of the file that was added (or modified) last. So when there are
> multiple files having a same last-modified timestamp.
> Suggested solution (thanks to [[Fabian
> Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]):
> a hybrid approach that keeps the names of all files that have a mod
> timestamp that is larger than the max mod time minus an offset.
> _org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_
--
This message was sent by Atlassian Jira
(v8.3.4#803005)