Hello Yang, For a cluster with a "Y" style dataflow, each node will have a run a copy of the whole flow. This means that at the merging point, only data within a cluster will get merged.
A little bit of a metaphor: say you want to create toys that combine multiple different parts together (input data) and you have two workers (nodes). The way that NiFi would break up the work is to give each worker the blueprints (the data flow) for the entire toy and each works on the necessary raw materials independently to create their own end product (end merged FlowFile). Raw materials from one worker are never merged with the raw materials of the other, they are worked on independently. NiFi uses the same concept of isolating the work to independent workers. There is a little wiggle room with re-distributing work to the nodes using S2S and using the primary node only scheduling strategy but those are special cases. Hope that metaphor helps a bit, Joe- - - - - - Joseph Percivall linkedin.com/in/Percivall e: [email protected] On Monday, April 25, 2016 6:29 PM, Yuanzhe Yang (杨远哲) <[email protected]> wrote: Hi Joe, Thanks a lot for your explanation and suggestion. As for the clustering question, what I actually want to ask is that, for example, when we have a two node cluster and a “Y” style dataflow, will the two nodes work on the two branches respectively? If so, what will happen after the result is merged at the intersection processor? Does one node become idle? Regards, Yang > On 25 Apr 2016, at 17:44, Joe Percivall <[email protected]> > wrote: > > Hello Yang, > > To better understand how data flows through NiFi to the processors you need > to understand FlowFiles. FlowFiles are the data record that gets processed by > the processors. FlowFiles are a pointer to content and a collection of > attributes. So each time the processor acts on the entire FlowFile produced > by the previous processor. > > For clustering, the flow is replicated to each node of the cluster. This > means each node in the cluster has a copy of the flow which it uses to > process all data sent to it (except for processor's marked as "primary node" > only, but that's a bit more advanced). > > Also for a better worded, more in-depth look into NiFi I would suggest > checking out the PR for the "NiFi In Depth" doc[1]. It would help answer many > questions you may have about the internals of NiFi. Also any comments on it > are much appreciated. > > [1] https://github.com/apache/nifi/pull/339#discussion_r60103526 > > Joe > > - - - - - - Joseph Percivall > linkedin.com/in/Percivall > e: [email protected] > > > > > On Monday, April 25, 2016 11:21 AM, Yuanzhe Yang (杨远哲) <[email protected]> > wrote: > Hi, > > I have read some documentation about NiFi, but I haven’t got a clear > impression about how data flows inside NiFi. Is it processed streamingly? Or > does a processor get the entire intermediate result produced by its previous > processor? Moreover, what is the granularity of clustering? Is it dataflow > level or processor level? > > Thank you very much for your clarification and your work is very much > appreciated. > > Regards, > Yang
