[
https://issues.apache.org/jira/browse/NIFI-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025153#comment-16025153
]
Bryan Bende commented on NIFI-3979:
-----------------------------------
After thinking about this some more, the PR I submitted is not correct. The
challenges here stem from the fact that we can't keep track of every filename
we listed for performance reasons, and also that we can't really compare the
system time where NiFi is running to the timestamps of the listings since they
are coming from an external file system.
The current behavior is that during an execution of the processor, we purposely
leave out the entries with the latest timestamp, and then include them in the
next listing. The reason this was done is because we don't know if more entries
with that timestamp are still coming in, and if we include the latest ones now,
then we will skip over the additional ones next iteration, or if we include the
latest ones now then we would have to duplicate them again in the next
iteration. So our current implementation favors no duplicates and no missed
data, with the limitation of latest entries lagging behind by one execution.
A potential solution, although somewhat complex to implement, might be to keep
track of a count of the number of entries with the latest timestamp. So say the
process runs and there are 10 files with the latest timestamp, we include all
of them in the current listing and we set a variable to 10, then next time we
execute we determine there are now 11 files with that previous timestamp, then
we can list them all again since we don't know which were listed. This leads to
duplicate listings in the edge case where files are written with the same
timestamp on each side of an execution, but in the common case would allow us
to always list the latest files.
We could also change nothing, and just document the behavior of this processor
and that it is expected to be scheduled fairly frequently, seconds or a few
minutes, and not hours.
> ListHDFS always skips files with latest timestamp
> -------------------------------------------------
>
> Key: NIFI-3979
> URL: https://issues.apache.org/jira/browse/NIFI-3979
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.1.0, 1.2.0, 1.1.1
> Reporter: Bryan Bende
> Assignee: Bryan Bende
> Priority: Minor
> Fix For: 1.3.0
>
>
> In NIFI-3213 there was a fix made for ListFile to correct a problem where it
> was never listing the latest file.
> The same problem exists in ListHDFS.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)