[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570581#comment-15570581 ]
Joseph Witt commented on NIFI-2853: ----------------------------------- If very clearly documented I could see how ListHDFS could be allowed to run on multiple nodes if configured to look at unique directories but it also seems to be a potentially unnecessary complexity. The idea for ListHDFS was to something very lightweight from a single node and the work could then be farmed out for the heavier listing across the cluster which would be the actual FetchHDFS calls. By allowing multiple nodes to execute ListHDFS at once we have to have some way to namespace the state and while the directory being listed would do it this also means we could end up needing to store an arbitrarily large number of directories. In the single node case regardless of how many directories we're pulling from it won't matter because a single value for timestamp of listing and emitting is sufficient for all of them (just look for anything in any matching directory that has changed since that time). > Improve ListHDFS state tracking > ------------------------------- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Bryan Bende > Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)