Peter I'm not sure there is a good way for a processor to drive such a thing with existing infrastructure. The processor having ability to know about the structure of a cluster is not something we have wanted to expose for good reasons. There would likely need to be a more fundamental point of support for this.
I'm not sure what that design would look like just yet - but agreeing this is an important step to take soon. If you want to start sketching out design ideas that would be awesome. Thanks On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pwi...@micron.com> wrote: > > Joe, > > I agree it is a lot of work, which is why I was thinking of starting with a > processor that could do some of these operations before looking further. If > the processor could move flowfile's between nodes in the cluster it would be > a good step. Data comes in form a queue on any node, but gets written out to > a queue on only the desired node; or gets round robin outputted for a > distribute scenario. > > I want to work on it, and was trying to figure out if it could be done using > only a processor, or if larger changes would be needed for sure. > > --Peter > > -----Original Message----- > From: Joe Witt [mailto:joe.w...@gmail.com] > Sent: Thursday, June 7, 2018 3:34 PM > To: dev@nifi.apache.org > Subject: Re: [EXT] Re: Primary Only Content Migration > > Peter, > > It isn't a pattern that is well supported now in a cluster context. > > What is needed are automatically load balanced connections with partitioning. > This would mean a user could select a given relationship and indicate that > data should automatically distributed and they should be able to express, > optionally, if there is a correlation attribute that is used for ensuring > data which belongs together stays together or becomes together. We could use > this to automatically have a connection result in data being distributed > across the cluster for load balancing purposes and also ensure that data is > brought back to a single node whenever necessary which is the case in certain > scenarios like fork/distribute/process/join/send and things like distributed > receipt then join for merging (like defragmenting data which has been split). > To join them together we need affinity/correlation and this could work based > on some sort of hashing mechanism where there are as many buckets as their > are nodes in a cluster at a given time. It needs a lot of > thought/design/testing/etc.. > > I was just having a conversation about this yesterday. It is definitely a > thing and will be a major effort. Will make a JIRA for this soon. > > Thanks > > On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pwi...@micron.com> > wrote: > > Bryan, > > > > We see this with large files that we have split up into smaller files and > > distributed across the cluster using site-to-site. We then want to merge > > them back together, so we send them to the primary node before continuing > > processing. > > > > --Peter > > > > -----Original Message----- > > From: Bryan Bende [mailto:bbe...@gmail.com] > > Sent: Thursday, June 7, 2018 12:47 PM > > To: dev@nifi.apache.org > > Subject: [EXT] Re: Primary Only Content Migration > > > > Peter, > > > > There really shouldn't be any non-source processors scheduled for primary > > node only. We may even want to consider preventing that option when the > > processor has an incoming connection to avoid creating any confusion. > > > > As long as you set source processors to primary node only then everything > > should be ok... if primary node changes, the source processor starts > > executing on the new primary node, and any flow files it already produced > > on the old primary node will continue to be worked off by the downstream > > processors on the old node until they are all processed. > > > > -Bryan > > > > > > > > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pwi...@micron.com> > > wrote: > >> I'm sure many of you have the same situation, a flow that runs on a > >> cluster, and at some point merges back down to a primary only processor; > >> your files sit there in the queue with nowhere to go... We've used the > >> work around of having a remote processor group that loops the data back to > >> the primary node for a while, but would really like a clean/simple > >> solution. This approach requires that users be able to put an input port > >> on the root flow, and then route the file back down, which is a nuisance. > >> > >> I have been thinking of adding either a processor that moves data between > >> specific nodes in a cluster, or a queue (?) option that will let users > >> migrate the content of a flowfile back to the master node. This would > >> allow you to move data back to a primary very easily without needing RPG's > >> and input ports at the root level. > >> > >> All of my development work with NiFi has been focused on processors, so > >> I'm not really sure where I would start with this. Thoughts? > >> > >> Thanks, > >> Peter