[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882081#comment-15882081
 ] 

Koji Kawamura commented on NIFI-3332:
-------------------------------------

[~jskora] I think the success marker file can be created by a processor in the 
same flow or external program that creates files need to be listed, but as a 
manual and additional step. If files are created by a 3rd party and you can't 
modify the behavior of such program, then this pattern would be useless. When 
you do, it can guarantee that all files written to a dir before _SUCCESS file 
is created are listed.

To complete NiFi's file listing capabilities, I come up with additional 
processors, those are, WatchFiles and DiffFiles:

!listfiles.png|width=100%!

h3. DiffFiles

Compared with ListFile, DiffFiles doesn't use managed state, instead it uses 
flow file.
It outputs list of filenames and probably a set of meta data such as timestamp, 
file size and hash ... etc as the content of an outgoing single FlowFile 
(probably JSON), so that it can handle large amount of those data.
When it receives the incoming FlowFile, check diffs between the incoming list 
(previous list) and current fetched list, then emits diffs to added, updated 
and removed accordingly.

This processor can be used against any filesystem, remote or local. It doesn't 
rely on timestamp as ListFile does.

h3. WatchFiles

As a project having more streaming processing in mind, we might need 
WatchDirectory processor for local file system or any other file systems that 
support watch API. It would be the same implementation pattern with ConsumeXXX.
Using watch service API available since 1.7.
Watching a Directory for Changes
https://docs.oracle.com/javase/tutorial/essential/io/notification.html

Do you think these additional processors will be helpful? I think if these are 
available, _SUCCESS file is no longer needed.

> Bug in ListXXX causes matching timestamps to be ignored on later runs
> ---------------------------------------------------------------------
>
>                 Key: NIFI-3332
>                 URL: https://issues.apache.org/jira/browse/NIFI-3332
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 0.7.1, 1.1.1
>            Reporter: Joe Skora
>            Assignee: Koji Kawamura
>            Priority: Critical
>         Attachments: listfiles.png, Test-showing-ListFile-timestamp-bug.log, 
> Test-showing-ListFile-timestamp-bug.patch
>
>
> The new state implementation for the ListXXX processors based on 
> AbstractListProcessor creates a race conditions when processor runs occur 
> while a batch of files is being written with the same timestamp.
> The changes to state management dropped tracking of the files processed for a 
> given timestamp.  Without the record of files processed, the remainder of the 
> batch is ignored on the next processor run since their timestamp is not 
> greater than the one timestamp stored in processor state.  With the file 
> tracking it was possible to process files that matched the timestamp exactly 
> and exclude the previously processed files.
> A basic time goes as follows.
>   T0 - system creates or receives batch of files with Tx timestamp where Tx 
> is more than the current timestamp in processor state.
>   T1 - system writes 1st half of Tx batch to the ListFile source directory.
>   T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp 
> in processor state.
>   T3 - system writes 2nd half of Tx batch to ListFile source directory.
>   T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx 
> timestamp batch.
> I've attached a patch[1] for TestListFile.java that adds an instrumented unit 
> test demonstrates the problem and a log[2] of the output from one such run.  
> The test writes 3 files each in two batches with processor runs after each 
> batch.  Batch 2 writes files with timestamps older than, equal to, and newer 
> than the timestamp stored when batch 1 was processed, but only the newer file 
> is picked up.  The older file is correctly ignored but file with the matchin 
> timestamp file should have been processed.
> [1] Test-showing-ListFile-timestamp-bug.patch
> [2] Test-showing-ListFile-timestamp-bug.log



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to