Joseph Witt commented on NIFI-2853:
If very clearly documented I could see how ListHDFS could be allowed to run on
multiple nodes if configured to look at unique directories but it also seems to
be a potentially unnecessary complexity.
The idea for ListHDFS was to something very lightweight from a single node and
the work could then be farmed out for the heavier listing across the cluster
which would be the actual FetchHDFS calls.
By allowing multiple nodes to execute ListHDFS at once we have to have some way
to namespace the state and while the directory being listed would do it this
also means we could end up needing to store an arbitrarily large number of
directories. In the single node case regardless of how many directories we're
pulling from it won't matter because a single value for timestamp of listing
and emitting is sufficient for all of them (just look for anything in any
matching directory that has changed since that time).
> Improve ListHDFS state tracking
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Reporter: Bryan Bende
> Priority: Minor
> Currently ListHDFS tracks two properties in state management,
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the
> directory property now supports expression language which means the directory
> being listed could dynamically change on any execution of the processor.
> The processor should be changed to store state specific to the directory that
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be
> run on primary node only, otherwise each node will be overwriting each others
> state and producing unexpected results. With the above improvement, if the
> directory evaluated to a unique path for each node, it would store the state
> of each of those path separately.
This message was sent by Atlassian JIRA