Sivaprasanna Sethuraman commented on NIFI-2853:

Although not a critical one, I believe this is an important feature that is 
needed. And I also think it is better to have not just the root level directory 
name appended to the "listing.timestamp" and "emitted.timestamp" but also 
include the sub directories, like "listing.timestamp.dir1.subdir2", 
"listing.timestamp.dir1.subdir3.subdir3_1" to avoid edgecase scenarios. The 
reason is, if we don't do that, files might not get picked up in some scenario. 
 # Create a directory "/tmp/sub-dir1"
 # Create a file "file1.txt" under "/tmp/sub-dir1"
 # Create a couple of files under "/tmp"
 # Create another file "file2.txt" under "/tmp/sub-dir1"

Now set ListHDFS as "Directory" : /tmp/sub-dir1. Run the flow. It will set the 
timestamp to the last accessed file which is "/tmp/sub-dir1/file2.txt". Now 
change the directory of ListHDFS to "/tmp", it won't pull in the files that 
were created in step 3 because those files modified time would be lesser than 
the timestamp stored as part of the processor's state. It will not happen with 
the said approach. Thoughts?

> Improve ListHDFS state tracking
> -------------------------------
>                 Key: NIFI-2853
>                 URL: https://issues.apache.org/jira/browse/NIFI-2853
>             Project: Apache NiFi
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Bryan Bende
>            Priority: Minor
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.

This message was sent by Atlassian JIRA

Reply via email to