[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535959#comment-16535959
 ] 

Kon Soulianidis commented on NIFI-3332:
---------------------------------------

Echoing [~doaks80] 's concerns.  We've been using the ListFiles processor to 
listen to a smb mounted share.  

Every 1-2 weeks, the processor just misses files.

I have checked the security logs in our company and the service account that 
nifi runs under has not even attempted to open those files.  However, it has 
pulled files received in that directory seconds earlier.

Our volumes aren't very high or frequent (usually 500 files (<100k) sent in 
small bursts over the course of 6 hours)

To workaround this issue, I created a processor that at the end of the day, 
issues a `touch *` on the directory that the ListFiles processor is listening 
on.  This updates the timestamp of any files left behind and triggers nifi to 
reprocess them

Obviously, the big selling point of using any solution that handles Enterprise 
Integration Pattern flows is the ability to handle the basics of 
synchronization. This has left a bad impression of nifi within our organisation 
which is a shame given the otherwise clean UI to build flows and audibility 
that provenance brings

Our nifi is a standalone docker instance using 1.6.0

> Bug in ListXXX causes matching timestamps to be ignored on later runs
> ---------------------------------------------------------------------
>
>                 Key: NIFI-3332
>                 URL: https://issues.apache.org/jira/browse/NIFI-3332
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 0.7.1, 1.1.1
>            Reporter: Joe Skora
>            Assignee: Koji Kawamura
>            Priority: Critical
>             Fix For: 1.4.0
>
>         Attachments: Test-showing-ListFile-timestamp-bug.log, 
> Test-showing-ListFile-timestamp-bug.patch, listfiles.png
>
>
> The new state implementation for the ListXXX processors based on 
> AbstractListProcessor creates a race conditions when processor runs occur 
> while a batch of files is being written with the same timestamp.
> The changes to state management dropped tracking of the files processed for a 
> given timestamp.  Without the record of files processed, the remainder of the 
> batch is ignored on the next processor run since their timestamp is not 
> greater than the one timestamp stored in processor state.  With the file 
> tracking it was possible to process files that matched the timestamp exactly 
> and exclude the previously processed files.
> A basic time goes as follows.
>   T0 - system creates or receives batch of files with Tx timestamp where Tx 
> is more than the current timestamp in processor state.
>   T1 - system writes 1st half of Tx batch to the ListFile source directory.
>   T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp 
> in processor state.
>   T3 - system writes 2nd half of Tx batch to ListFile source directory.
>   T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx 
> timestamp batch.
> I've attached a patch[1] for TestListFile.java that adds an instrumented unit 
> test demonstrates the problem and a log[2] of the output from one such run.  
> The test writes 3 files each in two batches with processor runs after each 
> batch.  Batch 2 writes files with timestamps older than, equal to, and newer 
> than the timestamp stored when batch 1 was processed, but only the newer file 
> is picked up.  The older file is correctly ignored but file with the matchin 
> timestamp file should have been processed.
> [1] Test-showing-ListFile-timestamp-bug.patch
> [2] Test-showing-ListFile-timestamp-bug.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to