Re: Implementation of ListFile's Primary Node only in a cluster

Bryan Bende Sat, 10 Feb 2018 11:30:06 -0800

Currently it means that the dataflow manager/developer is expected to
set the 'Execution Nodes' strategy to "Primary Node" at the time of
flow design.

We don't have anything that restricts the scheduling strategy of a
processor, but we probably should consider having an annotation like
@PrimaryNodeOnly that you can put on a processor and then the
framework will enforce that it can only be scheduled on primary node.

In the case of ListFile, I think the statement in the documentation is
only partially true...

When "Input Directory Location" is set to local, there should be no
issue with scheduling the processor on all nodes in the cluster, as it
would be listing a local directory and storing state locally.

When "Input Directory Location" is set to remote, it wouldn't make
sense to have all nodes listing the same remote directory and getting
the same results, and also the state is then stored in ZooKeeper under
a ZNode using the processor's UUID, and the processor has the same
UUID on each node so they would be overwriting each other's state in
ZK.

So ListFile probably can't be restricted to primary node only, where
as something like ListHDFS probably could because it is always listing
a remote destination.

On Fri, Feb 9, 2018 at 10:55 PM, Sivaprasanna <sivaprasanna...@gmail.com> wrote:
> I was going through ListFile processor's code and found out that in the
> documentation
> <https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ListFile.java#L72-L76>,
> it is mentioned that "this processor is designed to run on Primary Node
> only in a cluster". I want to understand what "designed" stands for here.
> Does that mean the processor was built in a way that it only runs on the
> Primary node regardless of the "Execution Nodes" strategy set to otherwise
> or does it mean that dataflow manager/developer is expected to set the
> 'Execution Nodes' strategy to "Primary Node" at the time of flow design? If
> it is of the former case, how is it handled in the code? If it is handled,
> it should be in the framework side but I don't see any annotation
> indicating anything related to such mechanism in the processor code and
> more over a related JIRA NIFI-543
> <https://issues.apache.org/jira/browse/NIFI-543> is also open so I want
> clear my doubt.
>
> -
> Sivaprasanna

Re: Implementation of ListFile's Primary Node only in a cluster

Reply via email to