[ 
https://issues.apache.org/jira/browse/FLINK-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649144#comment-16649144
 ] 

Huyen Levan commented on FLINK-9940:
------------------------------------

[~juanmirocks] Thanks for the link. I was not aware of that Jira ticket when I 
raised this. Those are two similar issues, which happens to all file systems 
because of files with the same timestamp.

However, with S3, the problem is more severe, as there are cases where newly 
created files (started appearing in directory scan after the last scan) have 
modification timestamp smaller than the globalModificationTime of the last scan.





> File source continuous monitoring mode: S3 files sometimes missed
> -----------------------------------------------------------------
>
>                 Key: FLINK-9940
>                 URL: https://issues.apache.org/jira/browse/FLINK-9940
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.5.1
>         Environment: Flink 1.5, EMRFS
>            Reporter: Huyen Levan
>            Assignee: Huyen Levan
>            Priority: Major
>              Labels: EMRFS, Flink, S3, pull-request-available
>             Fix For: 1.7.0
>
>
> When using StreamExecutionEnvironment.readFile() with 
> FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if 
> there is a high amount of new/modified files at the same time, the directory 
> monitoring process might miss some files. The number of missing files depends 
> on the monitoring interval.
> Cause: Flink tracks which files it has read by remembering the modification 
> time of the file that was added (or modified) last. So when there are 
> multiple files having a same last-modified timestamp.
> Suggested solution (thanks to [[Fabian 
> Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]):
>  a hybrid approach that keeps the names of all files that have a mod 
> timestamp that is larger than the max mod time minus an offset. 
> _org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to