Ricky,

So the use case you're coming from here is a good and common one which is:

If I have a datasource which does not offer scalabilty (it can only
send to a single node for instance) but I have a scalable distribution
cluster what are my options?

So today you can accept the data on a single node then immediate do as
Mark describes and fire it to a "Remote Process Group" addressing the
cluster itself.  That way NiFi will automatically figure out all the
nodes in the cluster and spread the data around factoring in
load/etc..  But we do want to establish an even more automatic
mechanism on a connection itself where the user can indicate the data
should be auto-balanced.

The reverse is really true as well where you can have a consumer which
only wants to accept from a single host.  So there too we need a
mechanism to descale the approach.

I realize the flow you're working with now is just a sort of
familiarization thing.  But do you think this is something we should
tackle soon (based on real scenarios you face)?

Thanks
Joe

On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]> wrote:
> Ricky,
>
>
>
>
> I don’t think there’s a JIRA ticket currently. Feel free to create one.
>
>
>
>
> I think we may need to do a better job documenting how the Remote Process 
> Groups. If you have a cluster setup, you would add a Remote Process Group 
> that points to the Cluster Manager. (I.e., the URL that you connect to in 
> order to see the graph).
>
>
> Then, anything that you send to the Remote Process Group will automatically 
> get load-balanced across all of the nodes in the cluster. So you could setup 
> a flow that looks something like:
>
>
> GenerateFlowFile -> RemoteProcessGroup
>
>
> Input Port -> HashContent
>
>
> So these 2 flows are disjoint. The first part generates data and then 
> distributes it to the cluster (when you connect to the Remote Process Group, 
> you choose which Input Port to send to).
>
>
> But what we’d like to do in the future is something like:
>
>
> GenerateFlowFile -> HashContent
>
>
> And then on the connection in the middle choose to auto-distribute the data. 
> Right now you have to put the Remote Process Group in there to distribute to 
> the cluster, and add the Input Port to receive the data. But there should 
> only be a single RemoteProcessGroup that points to the entire cluster, not 
> one per node.
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
>
>
>
> From: Ricky Saltzer
> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> To: [email protected]
>
>
>
>
>
> Mark -
>
> Thanks for the fast reply, much appreciated. This is what I figured, but
> since I was already in clustered mode, I wanted to make sure there wasn't
> an easier way than adding each node as a remote process group.
>
> Is there already a JIRA to track the ability to auto distribute in
> clustered mode, or would you like me to open it up?
>
> Thanks again,
> Ricky
>
> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]> wrote:
>
>> Ricky,
>>
>>
>> The DistributeLoad processor is simply used to route to one of many
>> relationships. So if you have, for instance, 5 different servers that you
>> can FTP files to, you can use DistributeLoad to round robin the files
>> between them, so that you end up pushing 20% to each of 5 PutFTP processors.
>>
>>
>> What you’re wanting to do, it sounds like, is to distribute the FlowFiles
>> to different nodes in the cluster. The Remote Process Group is how you
>> would need to do that at this time. We have discussed having the ability to
>> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) and
>> have that automatically distribute the data between nodes in the cluster,
>> but that feature hasn’t yet been implemented.
>>
>>
>> Does that answer your question?
>>
>>
>> Thanks
>>
>> -Mark
>>
>>
>>
>>
>>
>>
>> From: Ricky Saltzer
>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
>> To: [email protected]
>>
>>
>>
>>
>>
>> Hi -
>>
>> I have a question regarding load distribution in a clustered NiFi
>> environment. I have a really simple example, I'm using the GenerateFlowFile
>> processor to generate some random data, then I MD5 hash the file and print
>> out the resulting hash.
>>
>> I want only the primary node to generate the data, but I want both nodes in
>> the cluster to share the hashing workload. It appears if I set the
>> scheduling strategy to "On primary node" for the GenerateFlowFile
>> processor, then the next processor (HashContent) is only being accepted and
>> processed by a single node.
>>
>> I've put DistributeLoad processor in-between the HashContent and
>> GenerateFlowFile, but this requires me to use the remote process group to
>> distribute the load, which doesn't seem intuitive when I'm already
>> clustered.
>>
>> I guess my question is, is it possible for the DistributeLoad processor to
>> understand that NiFi is in a clustered environment, and have an ability to
>> distribute the next processor (HashContent) amongst all nodes in the
>> cluster?
>>
>> Cheers,
>> --
>> Ricky Saltzer
>> http://www.cloudera.com
>>
>
>
>
> --
> Ricky Saltzer
> http://www.cloudera.com

Reply via email to