Re: Load distribution in cluster mode

Joe Witt Sun, 08 Feb 2015 10:25:56 -0800

I think the change you made more recently was totally appropriate.
The right answer here in my opinion is to provide a way for users to
manage/view/understand this at runtime.


Site to site is a pretty great feature and we just need to give it a
more first-class treatment:
- Great documentation for users in the app
- Great documentation for the protocol itself and examples of clients
(we should likely even help seed the development of a few for popular
languages)
- Good user experience at runtime to modify and understand what is
happening, etc..

Also sorry for my unbelievably unreadable e-mail on this thread
earlier . I really should never send e-mails from my phone.

Thanks
Joe

On Sun, Feb 8, 2015 at 1:17 PM, Mark Payne <[email protected]> wrote:
> Originally, I set the default in the properties file so that site-to-site is 
> configured to be secure but not enabled. I did this because I didn’t want to 
> enable it as non-secure by default because I was afraid that this would be 
> dangerous… so I required that users explicitly go in and set it up. But 
> thinking back to this, maybe that was a mistake. We set the default UI port 
> to be 8080 and non-secure, so maybe we should just set the default so that 
> site-to-site is enabled non-secure, as well. That would probably just make 
> this a lot easier.
>
>
>
>
>
>
> From: Ricky Saltzer
> Sent: ‎Sunday‎, ‎February‎ ‎8‎, ‎2015 ‎1‎:‎16‎ ‎PM
> To: [email protected]
>
>
>
>
>
> Thanks for the tip, Mark! Allowing the user to enable the site to site
> feature during runtime would be a good step in the right direction.
> Documentation on how it works and why it's different from having your nodes
> in a cluster would also make things easier to understand.
>
>
> On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[email protected]> wrote:
>
>> Site to site is a powerhouse feature but has caused a good bit of
>> confusion.  Perhaps we should plan or its inclusion in the things that can
>> be tuned/set at runtime.
>>
>> It would be good to include with that information about bounded
>> interfaces.   Information about what messages will get sent, etc.   Folks
>> in proxy type situations have a hard time reasoning over what is
>> happening.  That is little sense of "is this thing on".
>>
>> What do you all think?
>> On Feb 8, 2015 6:51 AM, "Mark Payne" <[email protected]> wrote:
>>
>> > Ricky,
>> >
>> > In the nifi.properties file, there's a property named
>> > "nifi.remote.input.port". By default it's empty. Set that to whatever
>> port
>> > you want to use for site-to-site. Additionally, you'll need to either set
>> > "nifi.remote.input.secure" to false or configure keystore and truststore
>> > properties. Configure this for nodes and NCM.  After that you should be
>> > good to go (after restart)!
>> >
>> > If you run into any issues let us know.
>> >
>> > Thanks
>> > -Mark
>> >
>> > Sent from my iPhone
>> >
>> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[email protected]> wrote:
>> > >
>> > > Hey Joe -
>> > >
>> > > This makes sense, and I'm in the process of trying it out now. I'm
>> > running
>> > > into a small issue where the remote process group is saying neither of
>> > the
>> > > nodes are configured for Site-to-Site communication.
>> > >
>> > > Although not super intuitive, sending to the remote process group
>> > pointing
>> > > to the cluster should be fine as long as it works (which I'm sure it
>> > does).
>> > >
>> > > Ricky
>> > >
>> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[email protected]> wrote:
>> > >>
>> > >> Ricky,
>> > >>
>> > >> So the use case you're coming from here is a good and common one which
>> > is:
>> > >>
>> > >> If I have a datasource which does not offer scalabilty (it can only
>> > >> send to a single node for instance) but I have a scalable distribution
>> > >> cluster what are my options?
>> > >>
>> > >> So today you can accept the data on a single node then immediate do as
>> > >> Mark describes and fire it to a "Remote Process Group" addressing the
>> > >> cluster itself.  That way NiFi will automatically figure out all the
>> > >> nodes in the cluster and spread the data around factoring in
>> > >> load/etc..  But we do want to establish an even more automatic
>> > >> mechanism on a connection itself where the user can indicate the data
>> > >> should be auto-balanced.
>> > >>
>> > >> The reverse is really true as well where you can have a consumer which
>> > >> only wants to accept from a single host.  So there too we need a
>> > >> mechanism to descale the approach.
>> > >>
>> > >> I realize the flow you're working with now is just a sort of
>> > >> familiarization thing.  But do you think this is something we should
>> > >> tackle soon (based on real scenarios you face)?
>> > >>
>> > >> Thanks
>> > >> Joe
>> > >>
>> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]>
>> > wrote:
>> > >>> Ricky,
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create
>> one.
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> I think we may need to do a better job documenting how the Remote
>> > >> Process Groups. If you have a cluster setup, you would add a Remote
>> > Process
>> > >> Group that points to the Cluster Manager. (I.e., the URL that you
>> > connect
>> > >> to in order to see the graph).
>> > >>>
>> > >>>
>> > >>> Then, anything that you send to the Remote Process Group will
>> > >> automatically get load-balanced across all of the nodes in the
>> cluster.
>> > So
>> > >> you could setup a flow that looks something like:
>> > >>>
>> > >>>
>> > >>> GenerateFlowFile -> RemoteProcessGroup
>> > >>>
>> > >>>
>> > >>> Input Port -> HashContent
>> > >>>
>> > >>>
>> > >>> So these 2 flows are disjoint. The first part generates data and then
>> > >> distributes it to the cluster (when you connect to the Remote Process
>> > >> Group, you choose which Input Port to send to).
>> > >>>
>> > >>>
>> > >>> But what we’d like to do in the future is something like:
>> > >>>
>> > >>>
>> > >>> GenerateFlowFile -> HashContent
>> > >>>
>> > >>>
>> > >>> And then on the connection in the middle choose to auto-distribute
>> the
>> > >> data. Right now you have to put the Remote Process Group in there to
>> > >> distribute to the cluster, and add the Input Port to receive the data.
>> > But
>> > >> there should only be a single RemoteProcessGroup that points to the
>> > entire
>> > >> cluster, not one per node.
>> > >>>
>> > >>>
>> > >>> Thanks
>> > >>>
>> > >>> -Mark
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> From: Ricky Saltzer
>> > >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
>> > >>> To: [email protected]
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> Mark -
>> > >>>
>> > >>> Thanks for the fast reply, much appreciated. This is what I figured,
>> > but
>> > >>> since I was already in clustered mode, I wanted to make sure there
>> > wasn't
>> > >>> an easier way than adding each node as a remote process group.
>> > >>>
>> > >>> Is there already a JIRA to track the ability to auto distribute in
>> > >>> clustered mode, or would you like me to open it up?
>> > >>>
>> > >>> Thanks again,
>> > >>> Ricky
>> > >>>
>> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]>
>> > wrote:
>> > >>>>
>> > >>>> Ricky,
>> > >>>>
>> > >>>>
>> > >>>> The DistributeLoad processor is simply used to route to one of many
>> > >>>> relationships. So if you have, for instance, 5 different servers
>> that
>> > >> you
>> > >>>> can FTP files to, you can use DistributeLoad to round robin the
>> files
>> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
>> > >> processors.
>> > >>>>
>> > >>>>
>> > >>>> What you’re wanting to do, it sounds like, is to distribute the
>> > >> FlowFiles
>> > >>>> to different nodes in the cluster. The Remote Process Group is how
>> you
>> > >>>> would need to do that at this time. We have discussed having the
>> > >> ability to
>> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name
>> 😊)
>> > >> and
>> > >>>> have that automatically distribute the data between nodes in the
>> > >> cluster,
>> > >>>> but that feature hasn’t yet been implemented.
>> > >>>>
>> > >>>>
>> > >>>> Does that answer your question?
>> > >>>>
>> > >>>>
>> > >>>> Thanks
>> > >>>>
>> > >>>> -Mark
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> From: Ricky Saltzer
>> > >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
>> > >>>> To: [email protected]
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> Hi -
>> > >>>>
>> > >>>> I have a question regarding load distribution in a clustered NiFi
>> > >>>> environment. I have a really simple example, I'm using the
>> > >> GenerateFlowFile
>> > >>>> processor to generate some random data, then I MD5 hash the file and
>> > >> print
>> > >>>> out the resulting hash.
>> > >>>>
>> > >>>> I want only the primary node to generate the data, but I want both
>> > >> nodes in
>> > >>>> the cluster to share the hashing workload. It appears if I set the
>> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
>> > >>>> processor, then the next processor (HashContent) is only being
>> > accepted
>> > >> and
>> > >>>> processed by a single node.
>> > >>>>
>> > >>>> I've put DistributeLoad processor in-between the HashContent and
>> > >>>> GenerateFlowFile, but this requires me to use the remote process
>> group
>> > >> to
>> > >>>> distribute the load, which doesn't seem intuitive when I'm already
>> > >>>> clustered.
>> > >>>>
>> > >>>> I guess my question is, is it possible for the DistributeLoad
>> > processor
>> > >> to
>> > >>>> understand that NiFi is in a clustered environment, and have an
>> > ability
>> > >> to
>> > >>>> distribute the next processor (HashContent) amongst all nodes in the
>> > >>>> cluster?
>> > >>>>
>> > >>>> Cheers,
>> > >>>> --
>> > >>>> Ricky Saltzer
>> > >>>> http://www.cloudera.com
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Ricky Saltzer
>> > >>> http://www.cloudera.com
>> > >
>> > >
>> > >
>> > > --
>> > > Ricky Saltzer
>> > > http://www.cloudera.com
>> >
>>
>
>
>
> --
> Ricky Saltzer
> http://www.cloudera.com

Re: Load distribution in cluster mode

Reply via email to