I think the change you made more recently was totally appropriate. The right answer here in my opinion is to provide a way for users to manage/view/understand this at runtime.
Site to site is a pretty great feature and we just need to give it a more first-class treatment: - Great documentation for users in the app - Great documentation for the protocol itself and examples of clients (we should likely even help seed the development of a few for popular languages) - Good user experience at runtime to modify and understand what is happening, etc.. Also sorry for my unbelievably unreadable e-mail on this thread earlier . I really should never send e-mails from my phone. Thanks Joe On Sun, Feb 8, 2015 at 1:17 PM, Mark Payne <[email protected]> wrote: > Originally, I set the default in the properties file so that site-to-site is > configured to be secure but not enabled. I did this because I didn’t want to > enable it as non-secure by default because I was afraid that this would be > dangerous… so I required that users explicitly go in and set it up. But > thinking back to this, maybe that was a mistake. We set the default UI port > to be 8080 and non-secure, so maybe we should just set the default so that > site-to-site is enabled non-secure, as well. That would probably just make > this a lot easier. > > > > > > > From: Ricky Saltzer > Sent: Sunday, February 8, 2015 1:16 PM > To: [email protected] > > > > > > Thanks for the tip, Mark! Allowing the user to enable the site to site > feature during runtime would be a good step in the right direction. > Documentation on how it works and why it's different from having your nodes > in a cluster would also make things easier to understand. > > > On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[email protected]> wrote: > >> Site to site is a powerhouse feature but has caused a good bit of >> confusion. Perhaps we should plan or its inclusion in the things that can >> be tuned/set at runtime. >> >> It would be good to include with that information about bounded >> interfaces. Information about what messages will get sent, etc. Folks >> in proxy type situations have a hard time reasoning over what is >> happening. That is little sense of "is this thing on". >> >> What do you all think? >> On Feb 8, 2015 6:51 AM, "Mark Payne" <[email protected]> wrote: >> >> > Ricky, >> > >> > In the nifi.properties file, there's a property named >> > "nifi.remote.input.port". By default it's empty. Set that to whatever >> port >> > you want to use for site-to-site. Additionally, you'll need to either set >> > "nifi.remote.input.secure" to false or configure keystore and truststore >> > properties. Configure this for nodes and NCM. After that you should be >> > good to go (after restart)! >> > >> > If you run into any issues let us know. >> > >> > Thanks >> > -Mark >> > >> > Sent from my iPhone >> > >> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[email protected]> wrote: >> > > >> > > Hey Joe - >> > > >> > > This makes sense, and I'm in the process of trying it out now. I'm >> > running >> > > into a small issue where the remote process group is saying neither of >> > the >> > > nodes are configured for Site-to-Site communication. >> > > >> > > Although not super intuitive, sending to the remote process group >> > pointing >> > > to the cluster should be fine as long as it works (which I'm sure it >> > does). >> > > >> > > Ricky >> > > >> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[email protected]> wrote: >> > >> >> > >> Ricky, >> > >> >> > >> So the use case you're coming from here is a good and common one which >> > is: >> > >> >> > >> If I have a datasource which does not offer scalabilty (it can only >> > >> send to a single node for instance) but I have a scalable distribution >> > >> cluster what are my options? >> > >> >> > >> So today you can accept the data on a single node then immediate do as >> > >> Mark describes and fire it to a "Remote Process Group" addressing the >> > >> cluster itself. That way NiFi will automatically figure out all the >> > >> nodes in the cluster and spread the data around factoring in >> > >> load/etc.. But we do want to establish an even more automatic >> > >> mechanism on a connection itself where the user can indicate the data >> > >> should be auto-balanced. >> > >> >> > >> The reverse is really true as well where you can have a consumer which >> > >> only wants to accept from a single host. So there too we need a >> > >> mechanism to descale the approach. >> > >> >> > >> I realize the flow you're working with now is just a sort of >> > >> familiarization thing. But do you think this is something we should >> > >> tackle soon (based on real scenarios you face)? >> > >> >> > >> Thanks >> > >> Joe >> > >> >> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[email protected]> >> > wrote: >> > >>> Ricky, >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create >> one. >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> I think we may need to do a better job documenting how the Remote >> > >> Process Groups. If you have a cluster setup, you would add a Remote >> > Process >> > >> Group that points to the Cluster Manager. (I.e., the URL that you >> > connect >> > >> to in order to see the graph). >> > >>> >> > >>> >> > >>> Then, anything that you send to the Remote Process Group will >> > >> automatically get load-balanced across all of the nodes in the >> cluster. >> > So >> > >> you could setup a flow that looks something like: >> > >>> >> > >>> >> > >>> GenerateFlowFile -> RemoteProcessGroup >> > >>> >> > >>> >> > >>> Input Port -> HashContent >> > >>> >> > >>> >> > >>> So these 2 flows are disjoint. The first part generates data and then >> > >> distributes it to the cluster (when you connect to the Remote Process >> > >> Group, you choose which Input Port to send to). >> > >>> >> > >>> >> > >>> But what we’d like to do in the future is something like: >> > >>> >> > >>> >> > >>> GenerateFlowFile -> HashContent >> > >>> >> > >>> >> > >>> And then on the connection in the middle choose to auto-distribute >> the >> > >> data. Right now you have to put the Remote Process Group in there to >> > >> distribute to the cluster, and add the Input Port to receive the data. >> > But >> > >> there should only be a single RemoteProcessGroup that points to the >> > entire >> > >> cluster, not one per node. >> > >>> >> > >>> >> > >>> Thanks >> > >>> >> > >>> -Mark >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> From: Ricky Saltzer >> > >>> Sent: Friday, February 6, 2015 3:06 PM >> > >>> To: [email protected] >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> Mark - >> > >>> >> > >>> Thanks for the fast reply, much appreciated. This is what I figured, >> > but >> > >>> since I was already in clustered mode, I wanted to make sure there >> > wasn't >> > >>> an easier way than adding each node as a remote process group. >> > >>> >> > >>> Is there already a JIRA to track the ability to auto distribute in >> > >>> clustered mode, or would you like me to open it up? >> > >>> >> > >>> Thanks again, >> > >>> Ricky >> > >>> >> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[email protected]> >> > wrote: >> > >>>> >> > >>>> Ricky, >> > >>>> >> > >>>> >> > >>>> The DistributeLoad processor is simply used to route to one of many >> > >>>> relationships. So if you have, for instance, 5 different servers >> that >> > >> you >> > >>>> can FTP files to, you can use DistributeLoad to round robin the >> files >> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP >> > >> processors. >> > >>>> >> > >>>> >> > >>>> What you’re wanting to do, it sounds like, is to distribute the >> > >> FlowFiles >> > >>>> to different nodes in the cluster. The Remote Process Group is how >> you >> > >>>> would need to do that at this time. We have discussed having the >> > >> ability to >> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name >> 😊) >> > >> and >> > >>>> have that automatically distribute the data between nodes in the >> > >> cluster, >> > >>>> but that feature hasn’t yet been implemented. >> > >>>> >> > >>>> >> > >>>> Does that answer your question? >> > >>>> >> > >>>> >> > >>>> Thanks >> > >>>> >> > >>>> -Mark >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> From: Ricky Saltzer >> > >>>> Sent: Friday, February 6, 2015 2:56 PM >> > >>>> To: [email protected] >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> Hi - >> > >>>> >> > >>>> I have a question regarding load distribution in a clustered NiFi >> > >>>> environment. I have a really simple example, I'm using the >> > >> GenerateFlowFile >> > >>>> processor to generate some random data, then I MD5 hash the file and >> > >> print >> > >>>> out the resulting hash. >> > >>>> >> > >>>> I want only the primary node to generate the data, but I want both >> > >> nodes in >> > >>>> the cluster to share the hashing workload. It appears if I set the >> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile >> > >>>> processor, then the next processor (HashContent) is only being >> > accepted >> > >> and >> > >>>> processed by a single node. >> > >>>> >> > >>>> I've put DistributeLoad processor in-between the HashContent and >> > >>>> GenerateFlowFile, but this requires me to use the remote process >> group >> > >> to >> > >>>> distribute the load, which doesn't seem intuitive when I'm already >> > >>>> clustered. >> > >>>> >> > >>>> I guess my question is, is it possible for the DistributeLoad >> > processor >> > >> to >> > >>>> understand that NiFi is in a clustered environment, and have an >> > ability >> > >> to >> > >>>> distribute the next processor (HashContent) amongst all nodes in the >> > >>>> cluster? >> > >>>> >> > >>>> Cheers, >> > >>>> -- >> > >>>> Ricky Saltzer >> > >>>> http://www.cloudera.com >> > >>> >> > >>> >> > >>> >> > >>> -- >> > >>> Ricky Saltzer >> > >>> http://www.cloudera.com >> > > >> > > >> > > >> > > -- >> > > Ricky Saltzer >> > > http://www.cloudera.com >> > >> > > > > -- > Ricky Saltzer > http://www.cloudera.com
