Re: Dealing with cluster errors

Jeff Mon, 13 Feb 2017 09:16:52 -0800

Hello Joe,

What is the disk utilization on the nodes of your cluster while you're
having issues with using the UI?


I have done some testing under heavy disk utilization and have had to
increase the timeout values for cluster communication to prevent
replication requests from timing out.  Does your flow use Site-to-Site?

On Mon, Feb 13, 2017 at 11:43 AM Joe Gresock <[email protected]> wrote:

> "Can you tell us more about the processors using cluster scoped state and
> what the rates are through them?"
>
> In this case it's probably not relevant, because I have that processor
> stopped.  However, it's a custom MongoDB processor that stores the last
> mongo ID in the cluster scoped state, to enable scrolling through mongo
> results.  When it's enabled, it updates the state about  500 times /
> minute.
>
> Some other observations, though.. I've been able to manually throttle the
> data rate by slowly enabling more processors, while reducing their schedule
> thread count.  So far I've noticed that the issue is more likely to occur
> when the CPUs are maxed out for a while, though that's not particularly
> surprising.  I've noticed that prior to each time the console becomes
> unreachable, I tend to start seeing ThreadPoolRequestReplicator exceptions
> in the logs.
>
> I have also noticed that as my cluster has been draining flow files
> (started at around 3 million queued up), it's taking longer and longer to
> get into the bad state.  Not sure if this is related though, or if I've
> just lightened the CPU load by decreasing the scheduled threads.
>
> On Mon, Feb 13, 2017 at 4:25 PM, Joe Witt <[email protected]> wrote:
>
> > Joe
> >
> > Can you tell us more about the processors using cluster scoped state
> > and what the rates are through them?
> >
> > I could envision us putting too much strain on zk in some cases.
> >
> > Thanks
> > Joe
> >
> > On Mon, Feb 13, 2017 at 10:51 AM, Joe Gresock <[email protected]>
> wrote:
> > > I was able to externalize my zookeeper quorum, which is now running on
> 3
> > > separate VMs.  I am able to bring up the nifi cluster when my data flow
> > is
> > > stopped, and I can tell the zk migration worked because I have some
> > > processors with cluster-scoped state.
> > >
> > > However, I am still having a hard time getting the console to stay up,
> > with
> > > the same error messages from my original post.
> > >
> > > I also noticed the following error that I was wondering about:
> > >
> > > ThreadPoolRequestReplicator: Cannot replicate request GET
> > > /nifi-api/site-to-site because there are 100 outstanding HTTP Requests
> > > already.  Request Counts per URI = {/nifi-api/site-to-site=100}.
> > >
> > > I'm wondering if this is the underlying problem, though I don't know
> why
> > it
> > > would happen only during a high data volume, because I am currently not
> > > using site-to-site when I let the data run.  I have several self-RPG
> > > connections in the flow, but they are not being actively used when I
> > > process the data at the moment.
> > >
> > > Interestingly, I am able to run a custom processor that stores records
> in
> > > MongoDB without issue, but as soon as I run a RouteOnAttribute
> processor
> > as
> > > well, the console goes down again.
> > >
> > > Any other thoughts?
> > >
> > > On Fri, Feb 10, 2017 at 1:29 PM, Andrew Grande <[email protected]>
> > wrote:
> > >
> > >> Joe,
> > >>
> > >> External ZK quorum would be my first move. And make sure those boxes
> > have
> > >> fast disks and no heavy load from other processes.
> > >>
> > >> Andrew
> > >>
> > >> On Fri, Feb 10, 2017, 7:23 AM Joe Gresock <[email protected]> wrote:
> > >>
> > >> > I should add that the flows on the individual nodes appear to be
> > >> processing
> > >> > the data just fine, and the solution I've found so far is to just
> wait
> > >> for
> > >> > the data to subside, after which point the console comes up
> > successfully.
> > >> > So, no complaint on the durability of the underlying data flows.
> It's
> > >> just
> > >> > problematic that I can't reliably make changes to the flow during
> high
> > >> > traffic periods.
> > >> >
> > >> > On Fri, Feb 10, 2017 at 12:00 PM, Joe Gresock <[email protected]>
> > >> wrote:
> > >> >
> > >> > > We have a 7-node cluster and we currently use the embedded
> > zookeepers
> > >> on
> > >> > 3
> > >> > > of the nodes.  I've noticed that when we have a high volume in our
> > flow
> > >> > > (which is causing the CPU to be hit pretty hard), I have a really
> > hard
> > >> > time
> > >> > > getting the console page to come up, as it cycles through the
> > following
> > >> > > error messages when I relolad the page:
> > >> > >
> > >> > >
> > >> > >    - An unexpected error has occurred.  Please check the logs.
> > (there
> > >> is
> > >> > >    never any error in the logs for this one)
> > >> > >    - Could not replicate request to <hostname> because the node is
> > not
> > >> > >    connected   (this is never the current host I'm trying to hit,
> > which
> > >> > makes
> > >> > >    the error text feel a bit irrelevant to the user.  i.e., "I
> > wasn't
> > >> > trying
> > >> > >    to replicate a request to that node, I just want to load the
> > console
> > >> > on
> > >> > >    this node")
> > >> > >    - An error occurred communicating with the application core.
> > Please
> > >> > >    check the logs and fix any configuration issues before
> > restarting.
> > >> > (Again,
> > >> > >    can't find any errors in nifi-app.log or nifi-user.log)
> > >> > >
> > >> > > I can go about a half-hour reloading the page before it comes up
> > once,
> > >> > and
> > >> > > then I can only get maybe one action in before it auto-refreshes
> and
> > >> > shows
> > >> > > me one of the above error messages again.
> > >> > >
> > >> > > My first thought was that using some external zookeeper servers
> > would
> > >> > > improve this, but that's just a hunch.  Has anyone encountered
> this
> > >> > > behavior with high data volume?
> > >> > > Joe
> > >> > >
> > >> > > --
> > >> > > I know what it is to be in need, and I know what it is to have
> > >> plenty.  I
> > >> > > have learned the secret of being content in any and every
> situation,
> > >> > > whether well fed or hungry, whether living in plenty or in want.
> I
> > can
> > >> > > do all this through him who gives me strength.    *-Philippians
> > >> 4:12-13*
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > I know what it is to be in need, and I know what it is to have
> > plenty.  I
> > >> > have learned the secret of being content in any and every situation,
> > >> > whether well fed or hungry, whether living in plenty or in want.  I
> > can
> > >> do
> > >> > all this through him who gives me strength.    *-Philippians
> 4:12-13*
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > I know what it is to be in need, and I know what it is to have
> plenty.  I
> > > have learned the secret of being content in any and every situation,
> > > whether well fed or hungry, whether living in plenty or in want.  I can
> > do
> > > all this through him who gives me strength.    *-Philippians 4:12-13*
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
>

Re: Dealing with cluster errors

Reply via email to