Sivaprasanna Sethuraman commented on NIFI-2853:
Although not a critical one, I believe this is an important feature that is
needed. And I also think it is better to have not just the root level directory
name appended to the "listing.timestamp" and "emitted.timestamp" but also
include the sub directories, like "listing.timestamp.dir1.subdir2",
"listing.timestamp.dir1.subdir3.subdir3_1" to avoid edgecase scenarios. The
reason is, if we don't do that, files might not get picked up in some scenario.
# Create a directory "/tmp/sub-dir1"
# Create a file "file1.txt" under "/tmp/sub-dir1"
# Create a couple of files under "/tmp"
# Create another file "file2.txt" under "/tmp/sub-dir1"
Now set ListHDFS as "Directory" : /tmp/sub-dir1. Run the flow. It will set the
timestamp to the last accessed file which is "/tmp/sub-dir1/file2.txt". Now
change the directory of ListHDFS to "/tmp", it won't pull in the files that
were created in step 3 because those files modified time would be lesser than
the timestamp stored as part of the processor's state. It will not happen with
the said approach. Thoughts?
> Improve ListHDFS state tracking
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Reporter: Bryan Bende
> Priority: Minor
> Currently ListHDFS tracks two properties in state management,
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the
> directory property now supports expression language which means the directory
> being listed could dynamically change on any execution of the processor.
> The processor should be changed to store state specific to the directory that
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be
> run on primary node only, otherwise each node will be overwriting each others
> state and producing unexpected results. With the above improvement, if the
> directory evaluated to a unique path for each node, it would store the state
> of each of those path separately.
This message was sent by Atlassian JIRA