Re: Dealing with cluster errors

Joe Gresock Mon, 13 Feb 2017 09:21:15 -0800

The disk utilization is currently 90-95% used by system and user, and
iowait is very low.  We do use site-to-site.


Interestingly, I can no longer replicate the problem, which is good but
puzzling.  Since the problem first started, I have externalized the ZK
quorum and decreased the scheduled threads for some processors.

On Mon, Feb 13, 2017 at 5:15 PM, Jeff <[email protected]> wrote:

> Hello Joe,
>
> What is the disk utilization on the nodes of your cluster while you're
> having issues with using the UI?
>
> I have done some testing under heavy disk utilization and have had to
> increase the timeout values for cluster communication to prevent
> replication requests from timing out.  Does your flow use Site-to-Site?
>
> On Mon, Feb 13, 2017 at 11:43 AM Joe Gresock <[email protected]> wrote:
>
> > "Can you tell us more about the processors using cluster scoped state and
> > what the rates are through them?"
> >
> > In this case it's probably not relevant, because I have that processor
> > stopped.  However, it's a custom MongoDB processor that stores the last
> > mongo ID in the cluster scoped state, to enable scrolling through mongo
> > results.  When it's enabled, it updates the state about  500 times /
> > minute.
> >
> > Some other observations, though.. I've been able to manually throttle the
> > data rate by slowly enabling more processors, while reducing their
> schedule
> > thread count.  So far I've noticed that the issue is more likely to occur
> > when the CPUs are maxed out for a while, though that's not particularly
> > surprising.  I've noticed that prior to each time the console becomes
> > unreachable, I tend to start seeing ThreadPoolRequestReplicator
> exceptions
> > in the logs.
> >
> > I have also noticed that as my cluster has been draining flow files
> > (started at around 3 million queued up), it's taking longer and longer to
> > get into the bad state.  Not sure if this is related though, or if I've
> > just lightened the CPU load by decreasing the scheduled threads.
> >
> > On Mon, Feb 13, 2017 at 4:25 PM, Joe Witt <[email protected]> wrote:
> >
> > > Joe
> > >
> > > Can you tell us more about the processors using cluster scoped state
> > > and what the rates are through them?
> > >
> > > I could envision us putting too much strain on zk in some cases.
> > >
> > > Thanks
> > > Joe
> > >
> > > On Mon, Feb 13, 2017 at 10:51 AM, Joe Gresock <[email protected]>
> > wrote:
> > > > I was able to externalize my zookeeper quorum, which is now running
> on
> > 3
> > > > separate VMs.  I am able to bring up the nifi cluster when my data
> flow
> > > is
> > > > stopped, and I can tell the zk migration worked because I have some
> > > > processors with cluster-scoped state.
> > > >
> > > > However, I am still having a hard time getting the console to stay
> up,
> > > with
> > > > the same error messages from my original post.
> > > >
> > > > I also noticed the following error that I was wondering about:
> > > >
> > > > ThreadPoolRequestReplicator: Cannot replicate request GET
> > > > /nifi-api/site-to-site because there are 100 outstanding HTTP
> Requests
> > > > already.  Request Counts per URI = {/nifi-api/site-to-site=100}.
> > > >
> > > > I'm wondering if this is the underlying problem, though I don't know
> > why
> > > it
> > > > would happen only during a high data volume, because I am currently
> not
> > > > using site-to-site when I let the data run.  I have several self-RPG
> > > > connections in the flow, but they are not being actively used when I
> > > > process the data at the moment.
> > > >
> > > > Interestingly, I am able to run a custom processor that stores
> records
> > in
> > > > MongoDB without issue, but as soon as I run a RouteOnAttribute
> > processor
> > > as
> > > > well, the console goes down again.
> > > >
> > > > Any other thoughts?
> > > >
> > > > On Fri, Feb 10, 2017 at 1:29 PM, Andrew Grande <[email protected]>
> > > wrote:
> > > >
> > > >> Joe,
> > > >>
> > > >> External ZK quorum would be my first move. And make sure those boxes
> > > have
> > > >> fast disks and no heavy load from other processes.
> > > >>
> > > >> Andrew
> > > >>
> > > >> On Fri, Feb 10, 2017, 7:23 AM Joe Gresock <[email protected]>
> wrote:
> > > >>
> > > >> > I should add that the flows on the individual nodes appear to be
> > > >> processing
> > > >> > the data just fine, and the solution I've found so far is to just
> > wait
> > > >> for
> > > >> > the data to subside, after which point the console comes up
> > > successfully.
> > > >> > So, no complaint on the durability of the underlying data flows.
> > It's
> > > >> just
> > > >> > problematic that I can't reliably make changes to the flow during
> > high
> > > >> > traffic periods.
> > > >> >
> > > >> > On Fri, Feb 10, 2017 at 12:00 PM, Joe Gresock <[email protected]
> >
> > > >> wrote:
> > > >> >
> > > >> > > We have a 7-node cluster and we currently use the embedded
> > > zookeepers
> > > >> on
> > > >> > 3
> > > >> > > of the nodes.  I've noticed that when we have a high volume in
> our
> > > flow
> > > >> > > (which is causing the CPU to be hit pretty hard), I have a
> really
> > > hard
> > > >> > time
> > > >> > > getting the console page to come up, as it cycles through the
> > > following
> > > >> > > error messages when I relolad the page:
> > > >> > >
> > > >> > >
> > > >> > >    - An unexpected error has occurred.  Please check the logs.
> > > (there
> > > >> is
> > > >> > >    never any error in the logs for this one)
> > > >> > >    - Could not replicate request to <hostname> because the node
> is
> > > not
> > > >> > >    connected   (this is never the current host I'm trying to
> hit,
> > > which
> > > >> > makes
> > > >> > >    the error text feel a bit irrelevant to the user.  i.e., "I
> > > wasn't
> > > >> > trying
> > > >> > >    to replicate a request to that node, I just want to load the
> > > console
> > > >> > on
> > > >> > >    this node")
> > > >> > >    - An error occurred communicating with the application core.
> > > Please
> > > >> > >    check the logs and fix any configuration issues before
> > > restarting.
> > > >> > (Again,
> > > >> > >    can't find any errors in nifi-app.log or nifi-user.log)
> > > >> > >
> > > >> > > I can go about a half-hour reloading the page before it comes up
> > > once,
> > > >> > and
> > > >> > > then I can only get maybe one action in before it auto-refreshes
> > and
> > > >> > shows
> > > >> > > me one of the above error messages again.
> > > >> > >
> > > >> > > My first thought was that using some external zookeeper servers
> > > would
> > > >> > > improve this, but that's just a hunch.  Has anyone encountered
> > this
> > > >> > > behavior with high data volume?
> > > >> > > Joe
> > > >> > >
> > > >> > > --
> > > >> > > I know what it is to be in need, and I know what it is to have
> > > >> plenty.  I
> > > >> > > have learned the secret of being content in any and every
> > situation,
> > > >> > > whether well fed or hungry, whether living in plenty or in want.
> > I
> > > can
> > > >> > > do all this through him who gives me strength.    *-Philippians
> > > >> 4:12-13*
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > I know what it is to be in need, and I know what it is to have
> > > plenty.  I
> > > >> > have learned the secret of being content in any and every
> situation,
> > > >> > whether well fed or hungry, whether living in plenty or in want.
> I
> > > can
> > > >> do
> > > >> > all this through him who gives me strength.    *-Philippians
> > 4:12-13*
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > I know what it is to be in need, and I know what it is to have
> > plenty.  I
> > > > have learned the secret of being content in any and every situation,
> > > > whether well fed or hungry, whether living in plenty or in want.  I
> can
> > > do
> > > > all this through him who gives me strength.    *-Philippians 4:12-13*
> > >
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.    *-Philippians 4:12-13*
> >
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Dealing with cluster errors

Reply via email to