Joe, Some settings to try if these issues occur again:
In nifi.properties: nifi.cluster.node.read.timeout=30 sec In zookeeper.properties: tickTime=5000 Try switching your RemoteGroup settings from HTTP to RAW, and set the Communications Timeout to something like 1m. On Mon, Feb 13, 2017 at 12:21 PM Joe Gresock <[email protected]> wrote: > The disk utilization is currently 90-95% used by system and user, and > iowait is very low. We do use site-to-site. > > Interestingly, I can no longer replicate the problem, which is good but > puzzling. Since the problem first started, I have externalized the ZK > quorum and decreased the scheduled threads for some processors. > > On Mon, Feb 13, 2017 at 5:15 PM, Jeff <[email protected]> wrote: > > > Hello Joe, > > > > What is the disk utilization on the nodes of your cluster while you're > > having issues with using the UI? > > > > I have done some testing under heavy disk utilization and have had to > > increase the timeout values for cluster communication to prevent > > replication requests from timing out. Does your flow use Site-to-Site? > > > > On Mon, Feb 13, 2017 at 11:43 AM Joe Gresock <[email protected]> wrote: > > > > > "Can you tell us more about the processors using cluster scoped state > and > > > what the rates are through them?" > > > > > > In this case it's probably not relevant, because I have that processor > > > stopped. However, it's a custom MongoDB processor that stores the last > > > mongo ID in the cluster scoped state, to enable scrolling through mongo > > > results. When it's enabled, it updates the state about 500 times / > > > minute. > > > > > > Some other observations, though.. I've been able to manually throttle > the > > > data rate by slowly enabling more processors, while reducing their > > schedule > > > thread count. So far I've noticed that the issue is more likely to > occur > > > when the CPUs are maxed out for a while, though that's not particularly > > > surprising. I've noticed that prior to each time the console becomes > > > unreachable, I tend to start seeing ThreadPoolRequestReplicator > > exceptions > > > in the logs. > > > > > > I have also noticed that as my cluster has been draining flow files > > > (started at around 3 million queued up), it's taking longer and longer > to > > > get into the bad state. Not sure if this is related though, or if I've > > > just lightened the CPU load by decreasing the scheduled threads. > > > > > > On Mon, Feb 13, 2017 at 4:25 PM, Joe Witt <[email protected]> wrote: > > > > > > > Joe > > > > > > > > Can you tell us more about the processors using cluster scoped state > > > > and what the rates are through them? > > > > > > > > I could envision us putting too much strain on zk in some cases. > > > > > > > > Thanks > > > > Joe > > > > > > > > On Mon, Feb 13, 2017 at 10:51 AM, Joe Gresock <[email protected]> > > > wrote: > > > > > I was able to externalize my zookeeper quorum, which is now running > > on > > > 3 > > > > > separate VMs. I am able to bring up the nifi cluster when my data > > flow > > > > is > > > > > stopped, and I can tell the zk migration worked because I have some > > > > > processors with cluster-scoped state. > > > > > > > > > > However, I am still having a hard time getting the console to stay > > up, > > > > with > > > > > the same error messages from my original post. > > > > > > > > > > I also noticed the following error that I was wondering about: > > > > > > > > > > ThreadPoolRequestReplicator: Cannot replicate request GET > > > > > /nifi-api/site-to-site because there are 100 outstanding HTTP > > Requests > > > > > already. Request Counts per URI = {/nifi-api/site-to-site=100}. > > > > > > > > > > I'm wondering if this is the underlying problem, though I don't > know > > > why > > > > it > > > > > would happen only during a high data volume, because I am currently > > not > > > > > using site-to-site when I let the data run. I have several > self-RPG > > > > > connections in the flow, but they are not being actively used when > I > > > > > process the data at the moment. > > > > > > > > > > Interestingly, I am able to run a custom processor that stores > > records > > > in > > > > > MongoDB without issue, but as soon as I run a RouteOnAttribute > > > processor > > > > as > > > > > well, the console goes down again. > > > > > > > > > > Any other thoughts? > > > > > > > > > > On Fri, Feb 10, 2017 at 1:29 PM, Andrew Grande <[email protected] > > > > > > wrote: > > > > > > > > > >> Joe, > > > > >> > > > > >> External ZK quorum would be my first move. And make sure those > boxes > > > > have > > > > >> fast disks and no heavy load from other processes. > > > > >> > > > > >> Andrew > > > > >> > > > > >> On Fri, Feb 10, 2017, 7:23 AM Joe Gresock <[email protected]> > > wrote: > > > > >> > > > > >> > I should add that the flows on the individual nodes appear to be > > > > >> processing > > > > >> > the data just fine, and the solution I've found so far is to > just > > > wait > > > > >> for > > > > >> > the data to subside, after which point the console comes up > > > > successfully. > > > > >> > So, no complaint on the durability of the underlying data flows. > > > It's > > > > >> just > > > > >> > problematic that I can't reliably make changes to the flow > during > > > high > > > > >> > traffic periods. > > > > >> > > > > > >> > On Fri, Feb 10, 2017 at 12:00 PM, Joe Gresock < > [email protected] > > > > > > > >> wrote: > > > > >> > > > > > >> > > We have a 7-node cluster and we currently use the embedded > > > > zookeepers > > > > >> on > > > > >> > 3 > > > > >> > > of the nodes. I've noticed that when we have a high volume in > > our > > > > flow > > > > >> > > (which is causing the CPU to be hit pretty hard), I have a > > really > > > > hard > > > > >> > time > > > > >> > > getting the console page to come up, as it cycles through the > > > > following > > > > >> > > error messages when I relolad the page: > > > > >> > > > > > > >> > > > > > > >> > > - An unexpected error has occurred. Please check the logs. > > > > (there > > > > >> is > > > > >> > > never any error in the logs for this one) > > > > >> > > - Could not replicate request to <hostname> because the > node > > is > > > > not > > > > >> > > connected (this is never the current host I'm trying to > > hit, > > > > which > > > > >> > makes > > > > >> > > the error text feel a bit irrelevant to the user. i.e., "I > > > > wasn't > > > > >> > trying > > > > >> > > to replicate a request to that node, I just want to load > the > > > > console > > > > >> > on > > > > >> > > this node") > > > > >> > > - An error occurred communicating with the application > core. > > > > Please > > > > >> > > check the logs and fix any configuration issues before > > > > restarting. > > > > >> > (Again, > > > > >> > > can't find any errors in nifi-app.log or nifi-user.log) > > > > >> > > > > > > >> > > I can go about a half-hour reloading the page before it comes > up > > > > once, > > > > >> > and > > > > >> > > then I can only get maybe one action in before it > auto-refreshes > > > and > > > > >> > shows > > > > >> > > me one of the above error messages again. > > > > >> > > > > > > >> > > My first thought was that using some external zookeeper > servers > > > > would > > > > >> > > improve this, but that's just a hunch. Has anyone encountered > > > this > > > > >> > > behavior with high data volume? > > > > >> > > Joe > > > > >> > > > > > > >> > > -- > > > > >> > > I know what it is to be in need, and I know what it is to have > > > > >> plenty. I > > > > >> > > have learned the secret of being content in any and every > > > situation, > > > > >> > > whether well fed or hungry, whether living in plenty or in > want. > > > I > > > > can > > > > >> > > do all this through him who gives me strength. > *-Philippians > > > > >> 4:12-13* > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > I know what it is to be in need, and I know what it is to have > > > > plenty. I > > > > >> > have learned the secret of being content in any and every > > situation, > > > > >> > whether well fed or hungry, whether living in plenty or in want. > > I > > > > can > > > > >> do > > > > >> > all this through him who gives me strength. *-Philippians > > > 4:12-13* > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > I know what it is to be in need, and I know what it is to have > > > plenty. I > > > > > have learned the secret of being content in any and every > situation, > > > > > whether well fed or hungry, whether living in plenty or in want. I > > can > > > > do > > > > > all this through him who gives me strength. *-Philippians > 4:12-13* > > > > > > > > > > > > > > > > -- > > > I know what it is to be in need, and I know what it is to have > plenty. I > > > have learned the secret of being content in any and every situation, > > > whether well fed or hungry, whether living in plenty or in want. I can > > do > > > all this through him who gives me strength. *-Philippians 4:12-13* > > > > > > > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can do > all this through him who gives me strength. *-Philippians 4:12-13* >
