The disk utilization is currently 90-95% used by system and user, and iowait is very low. We do use site-to-site.
Interestingly, I can no longer replicate the problem, which is good but puzzling. Since the problem first started, I have externalized the ZK quorum and decreased the scheduled threads for some processors. On Mon, Feb 13, 2017 at 5:15 PM, Jeff <[email protected]> wrote: > Hello Joe, > > What is the disk utilization on the nodes of your cluster while you're > having issues with using the UI? > > I have done some testing under heavy disk utilization and have had to > increase the timeout values for cluster communication to prevent > replication requests from timing out. Does your flow use Site-to-Site? > > On Mon, Feb 13, 2017 at 11:43 AM Joe Gresock <[email protected]> wrote: > > > "Can you tell us more about the processors using cluster scoped state and > > what the rates are through them?" > > > > In this case it's probably not relevant, because I have that processor > > stopped. However, it's a custom MongoDB processor that stores the last > > mongo ID in the cluster scoped state, to enable scrolling through mongo > > results. When it's enabled, it updates the state about 500 times / > > minute. > > > > Some other observations, though.. I've been able to manually throttle the > > data rate by slowly enabling more processors, while reducing their > schedule > > thread count. So far I've noticed that the issue is more likely to occur > > when the CPUs are maxed out for a while, though that's not particularly > > surprising. I've noticed that prior to each time the console becomes > > unreachable, I tend to start seeing ThreadPoolRequestReplicator > exceptions > > in the logs. > > > > I have also noticed that as my cluster has been draining flow files > > (started at around 3 million queued up), it's taking longer and longer to > > get into the bad state. Not sure if this is related though, or if I've > > just lightened the CPU load by decreasing the scheduled threads. > > > > On Mon, Feb 13, 2017 at 4:25 PM, Joe Witt <[email protected]> wrote: > > > > > Joe > > > > > > Can you tell us more about the processors using cluster scoped state > > > and what the rates are through them? > > > > > > I could envision us putting too much strain on zk in some cases. > > > > > > Thanks > > > Joe > > > > > > On Mon, Feb 13, 2017 at 10:51 AM, Joe Gresock <[email protected]> > > wrote: > > > > I was able to externalize my zookeeper quorum, which is now running > on > > 3 > > > > separate VMs. I am able to bring up the nifi cluster when my data > flow > > > is > > > > stopped, and I can tell the zk migration worked because I have some > > > > processors with cluster-scoped state. > > > > > > > > However, I am still having a hard time getting the console to stay > up, > > > with > > > > the same error messages from my original post. > > > > > > > > I also noticed the following error that I was wondering about: > > > > > > > > ThreadPoolRequestReplicator: Cannot replicate request GET > > > > /nifi-api/site-to-site because there are 100 outstanding HTTP > Requests > > > > already. Request Counts per URI = {/nifi-api/site-to-site=100}. > > > > > > > > I'm wondering if this is the underlying problem, though I don't know > > why > > > it > > > > would happen only during a high data volume, because I am currently > not > > > > using site-to-site when I let the data run. I have several self-RPG > > > > connections in the flow, but they are not being actively used when I > > > > process the data at the moment. > > > > > > > > Interestingly, I am able to run a custom processor that stores > records > > in > > > > MongoDB without issue, but as soon as I run a RouteOnAttribute > > processor > > > as > > > > well, the console goes down again. > > > > > > > > Any other thoughts? > > > > > > > > On Fri, Feb 10, 2017 at 1:29 PM, Andrew Grande <[email protected]> > > > wrote: > > > > > > > >> Joe, > > > >> > > > >> External ZK quorum would be my first move. And make sure those boxes > > > have > > > >> fast disks and no heavy load from other processes. > > > >> > > > >> Andrew > > > >> > > > >> On Fri, Feb 10, 2017, 7:23 AM Joe Gresock <[email protected]> > wrote: > > > >> > > > >> > I should add that the flows on the individual nodes appear to be > > > >> processing > > > >> > the data just fine, and the solution I've found so far is to just > > wait > > > >> for > > > >> > the data to subside, after which point the console comes up > > > successfully. > > > >> > So, no complaint on the durability of the underlying data flows. > > It's > > > >> just > > > >> > problematic that I can't reliably make changes to the flow during > > high > > > >> > traffic periods. > > > >> > > > > >> > On Fri, Feb 10, 2017 at 12:00 PM, Joe Gresock <[email protected] > > > > > >> wrote: > > > >> > > > > >> > > We have a 7-node cluster and we currently use the embedded > > > zookeepers > > > >> on > > > >> > 3 > > > >> > > of the nodes. I've noticed that when we have a high volume in > our > > > flow > > > >> > > (which is causing the CPU to be hit pretty hard), I have a > really > > > hard > > > >> > time > > > >> > > getting the console page to come up, as it cycles through the > > > following > > > >> > > error messages when I relolad the page: > > > >> > > > > > >> > > > > > >> > > - An unexpected error has occurred. Please check the logs. > > > (there > > > >> is > > > >> > > never any error in the logs for this one) > > > >> > > - Could not replicate request to <hostname> because the node > is > > > not > > > >> > > connected (this is never the current host I'm trying to > hit, > > > which > > > >> > makes > > > >> > > the error text feel a bit irrelevant to the user. i.e., "I > > > wasn't > > > >> > trying > > > >> > > to replicate a request to that node, I just want to load the > > > console > > > >> > on > > > >> > > this node") > > > >> > > - An error occurred communicating with the application core. > > > Please > > > >> > > check the logs and fix any configuration issues before > > > restarting. > > > >> > (Again, > > > >> > > can't find any errors in nifi-app.log or nifi-user.log) > > > >> > > > > > >> > > I can go about a half-hour reloading the page before it comes up > > > once, > > > >> > and > > > >> > > then I can only get maybe one action in before it auto-refreshes > > and > > > >> > shows > > > >> > > me one of the above error messages again. > > > >> > > > > > >> > > My first thought was that using some external zookeeper servers > > > would > > > >> > > improve this, but that's just a hunch. Has anyone encountered > > this > > > >> > > behavior with high data volume? > > > >> > > Joe > > > >> > > > > > >> > > -- > > > >> > > I know what it is to be in need, and I know what it is to have > > > >> plenty. I > > > >> > > have learned the secret of being content in any and every > > situation, > > > >> > > whether well fed or hungry, whether living in plenty or in want. > > I > > > can > > > >> > > do all this through him who gives me strength. *-Philippians > > > >> 4:12-13* > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > I know what it is to be in need, and I know what it is to have > > > plenty. I > > > >> > have learned the secret of being content in any and every > situation, > > > >> > whether well fed or hungry, whether living in plenty or in want. > I > > > can > > > >> do > > > >> > all this through him who gives me strength. *-Philippians > > 4:12-13* > > > >> > > > > >> > > > > > > > > > > > > > > > > -- > > > > I know what it is to be in need, and I know what it is to have > > plenty. I > > > > have learned the secret of being content in any and every situation, > > > > whether well fed or hungry, whether living in plenty or in want. I > can > > > do > > > > all this through him who gives me strength. *-Philippians 4:12-13* > > > > > > > > > > > -- > > I know what it is to be in need, and I know what it is to have plenty. I > > have learned the secret of being content in any and every situation, > > whether well fed or hungry, whether living in plenty or in want. I can > do > > all this through him who gives me strength. *-Philippians 4:12-13* > > > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
