Hi Joe,

Thanks a lot for your detailed explanation. That’s clear now :D

Regards,
Yang

> On 26 Apr 2016, at 01:16, Joe Percivall <[email protected]> 
> wrote:
> 
> Hello Yang,
> 
> For a cluster with a "Y" style dataflow, each node will have a run a copy of 
> the whole flow. This means that at the merging point, only data within a 
> cluster will get merged.
> 
> A little bit of a metaphor: say you want to create toys that combine multiple 
> different parts together (input data) and you have two workers (nodes). The 
> way that NiFi would break up the work is to give each worker the blueprints 
> (the data flow) for the entire toy and each works on the necessary raw 
> materials independently to create their own end product (end merged 
> FlowFile). Raw materials from one worker are never merged with the raw 
> materials of the other, they are worked on independently.
> 
> NiFi uses the same concept of isolating the work to independent workers. 
> 
> There is a little wiggle room with re-distributing work to the nodes using 
> S2S and using the primary node only scheduling strategy but those are special 
> cases.
> 
> Hope that metaphor helps a bit,
> Joe- - - - - - 
> Joseph Percivall
> linkedin.com/in/Percivall
> e: [email protected]
> 
> 
> 
> 
> On Monday, April 25, 2016 6:29 PM, Yuanzhe Yang (杨远哲) <[email protected]> 
> wrote:
> Hi Joe,
> 
> Thanks a lot for your explanation and suggestion. As for the clustering 
> question, what I actually want to ask is that, for example, when we have a 
> two node cluster and a “Y” style dataflow, will the two nodes work on the two 
> branches respectively? If so, what will happen after the result is merged at 
> the intersection processor? Does one node become idle? 
> 
> Regards,
> Yang
> 
> 
>> On 25 Apr 2016, at 17:44, Joe Percivall <[email protected]> 
>> wrote:
>> 
>> Hello Yang,
>> 
>> To better understand how data flows through NiFi to the processors you need 
>> to understand FlowFiles. FlowFiles are the data record that gets processed 
>> by the processors. FlowFiles are a pointer to content and a collection of 
>> attributes. So each time the processor acts on the entire FlowFile produced 
>> by the previous processor. 
>> 
>> For clustering, the flow is replicated to each node of the cluster. This 
>> means each node in the cluster has a copy of the flow which it uses to 
>> process all data sent to it (except for processor's marked as "primary node" 
>> only, but that's a bit more advanced).
>> 
>> Also for a better worded, more in-depth look into NiFi I would suggest 
>> checking out the PR for the "NiFi In Depth" doc[1]. It would help answer 
>> many questions you may have about the internals of NiFi. Also any comments 
>> on it are much appreciated.
>> 
>> [1] https://github.com/apache/nifi/pull/339#discussion_r60103526
>> 
>> Joe
>> 
>> - - - - - - Joseph Percivall
>> linkedin.com/in/Percivall
>> e: [email protected]
>> 
>> 
>> 
>> 
>> On Monday, April 25, 2016 11:21 AM, Yuanzhe Yang (杨远哲) <[email protected]> 
>> wrote:
>> Hi,
>> 
>> I have read some documentation about NiFi, but I haven’t got a clear 
>> impression about how data flows inside NiFi. Is it processed streamingly? Or 
>> does a processor get the entire intermediate result produced by its previous 
>> processor? Moreover, what is the granularity of clustering? Is it dataflow 
>> level or processor level?
>> 
>> Thank you very much for your clarification and your work is very much 
>> appreciated.
>> 
>> Regards,
>> Yang

Reply via email to