Ricky, So the use case you're coming from here is a good and common one which is:
If I have a datasource which does not offer scalabilty (it can only send to a single node for instance) but I have a scalable distribution cluster what are my options? So today you can accept the data on a single node then immediate do as Mark describes and fire it to a "Remote Process Group" addressing the cluster itself. That way NiFi will automatically figure out all the nodes in the cluster and spread the data around factoring in load/etc.. But we do want to establish an even more automatic mechanism on a connection itself where the user can indicate the data should be auto-balanced. The reverse is really true as well where you can have a consumer which only wants to accept from a single host. So there too we need a mechanism to descale the approach. I realize the flow you're working with now is just a sort of familiarization thing. But do you think this is something we should tackle soon (based on real scenarios you face)? Thanks Joe On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]> wrote: > Ricky, > > > > > I don’t think there’s a JIRA ticket currently. Feel free to create one. > > > > > I think we may need to do a better job documenting how the Remote Process > Groups. If you have a cluster setup, you would add a Remote Process Group > that points to the Cluster Manager. (I.e., the URL that you connect to in > order to see the graph). > > > Then, anything that you send to the Remote Process Group will automatically > get load-balanced across all of the nodes in the cluster. So you could setup > a flow that looks something like: > > > GenerateFlowFile -> RemoteProcessGroup > > > Input Port -> HashContent > > > So these 2 flows are disjoint. The first part generates data and then > distributes it to the cluster (when you connect to the Remote Process Group, > you choose which Input Port to send to). > > > But what we’d like to do in the future is something like: > > > GenerateFlowFile -> HashContent > > > And then on the connection in the middle choose to auto-distribute the data. > Right now you have to put the Remote Process Group in there to distribute to > the cluster, and add the Input Port to receive the data. But there should > only be a single RemoteProcessGroup that points to the entire cluster, not > one per node. > > > Thanks > > -Mark > > > > > > > > > > From: Ricky Saltzer > Sent: Friday, February 6, 2015 3:06 PM > To: [email protected] > > > > > > Mark - > > Thanks for the fast reply, much appreciated. This is what I figured, but > since I was already in clustered mode, I wanted to make sure there wasn't > an easier way than adding each node as a remote process group. > > Is there already a JIRA to track the ability to auto distribute in > clustered mode, or would you like me to open it up? > > Thanks again, > Ricky > > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]> wrote: > >> Ricky, >> >> >> The DistributeLoad processor is simply used to route to one of many >> relationships. So if you have, for instance, 5 different servers that you >> can FTP files to, you can use DistributeLoad to round robin the files >> between them, so that you end up pushing 20% to each of 5 PutFTP processors. >> >> >> What you’re wanting to do, it sounds like, is to distribute the FlowFiles >> to different nodes in the cluster. The Remote Process Group is how you >> would need to do that at this time. We have discussed having the ability to >> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) and >> have that automatically distribute the data between nodes in the cluster, >> but that feature hasn’t yet been implemented. >> >> >> Does that answer your question? >> >> >> Thanks >> >> -Mark >> >> >> >> >> >> >> From: Ricky Saltzer >> Sent: Friday, February 6, 2015 2:56 PM >> To: [email protected] >> >> >> >> >> >> Hi - >> >> I have a question regarding load distribution in a clustered NiFi >> environment. I have a really simple example, I'm using the GenerateFlowFile >> processor to generate some random data, then I MD5 hash the file and print >> out the resulting hash. >> >> I want only the primary node to generate the data, but I want both nodes in >> the cluster to share the hashing workload. It appears if I set the >> scheduling strategy to "On primary node" for the GenerateFlowFile >> processor, then the next processor (HashContent) is only being accepted and >> processed by a single node. >> >> I've put DistributeLoad processor in-between the HashContent and >> GenerateFlowFile, but this requires me to use the remote process group to >> distribute the load, which doesn't seem intuitive when I'm already >> clustered. >> >> I guess my question is, is it possible for the DistributeLoad processor to >> understand that NiFi is in a clustered environment, and have an ability to >> distribute the next processor (HashContent) amongst all nodes in the >> cluster? >> >> Cheers, >> -- >> Ricky Saltzer >> http://www.cloudera.com >> > > > > -- > Ricky Saltzer > http://www.cloudera.com
