[
https://issues.apache.org/jira/browse/FLINK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249322#comment-16249322
]
Kostas Kloudas commented on FLINK-8046:
---------------------------------------
Hi [~jmcejuela]! Thanks a lot for reporting this and working on it.
As I commented in the Mailing List thread you opened, I do not think that the
solution is not to remove
the {{=}} from the {{modificationTime <= globalModificationTime;}} in the
{{ContinuousFileMonitoringFunction}}, as this
would lead to duplicates.
The solution, in my opinion is to keep a list of the filenames (or hashes) of
the files processed for the last {{globalModTimestamp}} (and only for that
timestamp) and when there are new with the same timestamp, then check if the
name of the file they belong is in that list.
This way you pay a bit of memory but you get what you want.
What do you think?
> ContinuousFileMonitoringFunction wrongly ignores files with exact same
> timestamp
> --------------------------------------------------------------------------------
>
> Key: FLINK-8046
> URL: https://issues.apache.org/jira/browse/FLINK-8046
> Project: Flink
> Issue Type: Bug
> Components: Streaming
> Affects Versions: 1.3.2
> Reporter: Juan Miguel Cejuela
> Labels: stream
> Fix For: 1.5.0
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> The current monitoring of files sets the internal variable
> `globalModificationTime` to filter out files that are "older". However, the
> current test (to check "older") does
> `boolean shouldIgnore = modificationTime <= globalModificationTime;` (rom
> `shouldIgnore`)
> The comparison should strictly be SMALLER (NOT smaller or equal). The method
> documentation also states "This happens if the modification time of the file
> is _smaller_ than...".
> The equality acceptance for "older", makes some files with same exact
> timestamp to be ignored. The behavior is also non-deterministic, as the first
> file to be accepted ("first" being pretty much random) makes the rest of
> files with same exact timestamp to be ignored.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)