Re: [EXT] Re: Primary Only Content Migration

Joe Witt Thu, 07 Jun 2018 16:19:58 -0700

Peter

I'm not sure there is a good way for a processor to drive such a thing
with existing infrastructure.  The processor having ability to know
about the structure of a cluster is not something we have wanted to
expose for good reasons.  There would likely need to be a more
fundamental point of support for this.


I'm not sure what that design would look like just yet - but agreeing
this is an important step to take soon.  If you want to start
sketching out design ideas that would be awesome.

Thanks
On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pwi...@micron.com> wrote:
>
> Joe,
>
> I agree it is a lot of work, which is why I was thinking of starting with a 
> processor that could do some of these operations before looking further. If 
> the processor could move flowfile's between nodes in the cluster it would be 
> a good step. Data comes in form a queue on any node, but gets written out to 
> a queue on only the desired node; or gets round robin outputted for a 
> distribute scenario.
>
> I want to work on it, and was trying to figure out if it could be done using 
> only a processor, or if larger changes would be needed for sure.
>
> --Peter
>
> -----Original Message-----
> From: Joe Witt [mailto:joe.w...@gmail.com]
> Sent: Thursday, June 7, 2018 3:34 PM
> To: dev@nifi.apache.org
> Subject: Re: [EXT] Re: Primary Only Content Migration
>
> Peter,
>
> It isn't a pattern that is well supported now in a cluster context.
>
> What is needed are automatically load balanced connections with partitioning. 
>  This would mean a user could select a given relationship and indicate that 
> data should automatically distributed and they should be able to express, 
> optionally, if there is a correlation attribute that is used for ensuring 
> data which belongs together stays together or becomes together.  We could use 
> this to automatically have a connection result in data being distributed 
> across the cluster for load balancing purposes and also ensure that data is 
> brought back to a single node whenever necessary which is the case in certain 
> scenarios like fork/distribute/process/join/send and things like distributed 
> receipt then join for merging (like defragmenting data which has been split). 
>  To join them together we need affinity/correlation and this could work based 
> on some sort of hashing mechanism where there are as many buckets as their 
> are nodes in a cluster at a given time.  It needs a lot of 
> thought/design/testing/etc..
>
> I was just having a conversation about this yesterday.  It is definitely a 
> thing and will be a major effort.  Will make a JIRA for this soon.
>
> Thanks
>
> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pwi...@micron.com> 
> wrote:
> > Bryan,
> >
> > We see this with large files that we have split up into smaller files and 
> > distributed across the cluster using site-to-site. We then want to merge 
> > them back together, so we send them to the primary node before continuing 
> > processing.
> >
> > --Peter
> >
> > -----Original Message-----
> > From: Bryan Bende [mailto:bbe...@gmail.com]
> > Sent: Thursday, June 7, 2018 12:47 PM
> > To: dev@nifi.apache.org
> > Subject: [EXT] Re: Primary Only Content Migration
> >
> > Peter,
> >
> > There really shouldn't be any non-source processors scheduled for primary 
> > node only. We may even want to consider preventing that option when the 
> > processor has an incoming connection to avoid creating any confusion.
> >
> > As long as you set source processors to primary node only then everything 
> > should be ok... if primary node changes, the source processor starts 
> > executing on the new primary node, and any flow files it already produced 
> > on the old primary node will continue to be worked off by the downstream 
> > processors on the old node until they are all processed.
> >
> > -Bryan
> >
> >
> >
> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pwi...@micron.com> 
> > wrote:
> >> I'm sure many of you have the same situation, a flow that runs on a 
> >> cluster, and at some point merges back down to a primary only processor; 
> >> your files sit there in the queue with nowhere to go... We've used the 
> >> work around of having a remote processor group that loops the data back to 
> >> the primary node for a while, but would really like a clean/simple 
> >> solution. This approach requires that users be able to put an input port 
> >> on the root flow, and then route the file back down, which is a nuisance.
> >>
> >> I have been thinking of adding either a processor that moves data between 
> >> specific nodes in a cluster, or a queue (?) option that will let users 
> >> migrate the content of a flowfile back to the master node. This would 
> >> allow you to move data back to a primary very easily without needing RPG's 
> >> and input ports at the root level.
> >>
> >> All of my development work with NiFi has been focused on processors, so 
> >> I'm not really sure where I would start with this.  Thoughts?
> >>
> >> Thanks,
> >>   Peter

Re: [EXT] Re: Primary Only Content Migration

Reply via email to