Re: Dealing with cluster errors

Jeff Mon, 13 Feb 2017 09:34:50 -0800

Joe,

Some settings to try if these issues occur again:


In nifi.properties:
nifi.cluster.node.read.timeout=30 sec

In zookeeper.properties:
tickTime=5000

Try switching your RemoteGroup settings from HTTP to RAW, and set the
Communications Timeout to something like 1m.

On Mon, Feb 13, 2017 at 12:21 PM Joe Gresock <[email protected]> wrote:

> The disk utilization is currently 90-95% used by system and user, and
> iowait is very low.  We do use site-to-site.
>
> Interestingly, I can no longer replicate the problem, which is good but
> puzzling.  Since the problem first started, I have externalized the ZK
> quorum and decreased the scheduled threads for some processors.
>
> On Mon, Feb 13, 2017 at 5:15 PM, Jeff <[email protected]> wrote:
>
> > Hello Joe,
> >
> > What is the disk utilization on the nodes of your cluster while you're
> > having issues with using the UI?
> >
> > I have done some testing under heavy disk utilization and have had to
> > increase the timeout values for cluster communication to prevent
> > replication requests from timing out.  Does your flow use Site-to-Site?
> >
> > On Mon, Feb 13, 2017 at 11:43 AM Joe Gresock <[email protected]> wrote:
> >
> > > "Can you tell us more about the processors using cluster scoped state
> and
> > > what the rates are through them?"
> > >
> > > In this case it's probably not relevant, because I have that processor
> > > stopped.  However, it's a custom MongoDB processor that stores the last
> > > mongo ID in the cluster scoped state, to enable scrolling through mongo
> > > results.  When it's enabled, it updates the state about  500 times /
> > > minute.
> > >
> > > Some other observations, though.. I've been able to manually throttle
> the
> > > data rate by slowly enabling more processors, while reducing their
> > schedule
> > > thread count.  So far I've noticed that the issue is more likely to
> occur
> > > when the CPUs are maxed out for a while, though that's not particularly
> > > surprising.  I've noticed that prior to each time the console becomes
> > > unreachable, I tend to start seeing ThreadPoolRequestReplicator
> > exceptions
> > > in the logs.
> > >
> > > I have also noticed that as my cluster has been draining flow files
> > > (started at around 3 million queued up), it's taking longer and longer
> to
> > > get into the bad state.  Not sure if this is related though, or if I've
> > > just lightened the CPU load by decreasing the scheduled threads.
> > >
> > > On Mon, Feb 13, 2017 at 4:25 PM, Joe Witt <[email protected]> wrote:
> > >
> > > > Joe
> > > >
> > > > Can you tell us more about the processors using cluster scoped state
> > > > and what the rates are through them?
> > > >
> > > > I could envision us putting too much strain on zk in some cases.
> > > >
> > > > Thanks
> > > > Joe
> > > >
> > > > On Mon, Feb 13, 2017 at 10:51 AM, Joe Gresock <[email protected]>
> > > wrote:
> > > > > I was able to externalize my zookeeper quorum, which is now running
> > on
> > > 3
> > > > > separate VMs.  I am able to bring up the nifi cluster when my data
> > flow
> > > > is
> > > > > stopped, and I can tell the zk migration worked because I have some
> > > > > processors with cluster-scoped state.
> > > > >
> > > > > However, I am still having a hard time getting the console to stay
> > up,
> > > > with
> > > > > the same error messages from my original post.
> > > > >
> > > > > I also noticed the following error that I was wondering about:
> > > > >
> > > > > ThreadPoolRequestReplicator: Cannot replicate request GET
> > > > > /nifi-api/site-to-site because there are 100 outstanding HTTP
> > Requests
> > > > > already.  Request Counts per URI = {/nifi-api/site-to-site=100}.
> > > > >
> > > > > I'm wondering if this is the underlying problem, though I don't
> know
> > > why
> > > > it
> > > > > would happen only during a high data volume, because I am currently
> > not
> > > > > using site-to-site when I let the data run.  I have several
> self-RPG
> > > > > connections in the flow, but they are not being actively used when
> I
> > > > > process the data at the moment.
> > > > >
> > > > > Interestingly, I am able to run a custom processor that stores
> > records
> > > in
> > > > > MongoDB without issue, but as soon as I run a RouteOnAttribute
> > > processor
> > > > as
> > > > > well, the console goes down again.
> > > > >
> > > > > Any other thoughts?
> > > > >
> > > > > On Fri, Feb 10, 2017 at 1:29 PM, Andrew Grande <[email protected]
> >
> > > > wrote:
> > > > >
> > > > >> Joe,
> > > > >>
> > > > >> External ZK quorum would be my first move. And make sure those
> boxes
> > > > have
> > > > >> fast disks and no heavy load from other processes.
> > > > >>
> > > > >> Andrew
> > > > >>
> > > > >> On Fri, Feb 10, 2017, 7:23 AM Joe Gresock <[email protected]>
> > wrote:
> > > > >>
> > > > >> > I should add that the flows on the individual nodes appear to be
> > > > >> processing
> > > > >> > the data just fine, and the solution I've found so far is to
> just
> > > wait
> > > > >> for
> > > > >> > the data to subside, after which point the console comes up
> > > > successfully.
> > > > >> > So, no complaint on the durability of the underlying data flows.
> > > It's
> > > > >> just
> > > > >> > problematic that I can't reliably make changes to the flow
> during
> > > high
> > > > >> > traffic periods.
> > > > >> >
> > > > >> > On Fri, Feb 10, 2017 at 12:00 PM, Joe Gresock <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > We have a 7-node cluster and we currently use the embedded
> > > > zookeepers
> > > > >> on
> > > > >> > 3
> > > > >> > > of the nodes.  I've noticed that when we have a high volume in
> > our
> > > > flow
> > > > >> > > (which is causing the CPU to be hit pretty hard), I have a
> > really
> > > > hard
> > > > >> > time
> > > > >> > > getting the console page to come up, as it cycles through the
> > > > following
> > > > >> > > error messages when I relolad the page:
> > > > >> > >
> > > > >> > >
> > > > >> > >    - An unexpected error has occurred.  Please check the logs.
> > > > (there
> > > > >> is
> > > > >> > >    never any error in the logs for this one)
> > > > >> > >    - Could not replicate request to <hostname> because the
> node
> > is
> > > > not
> > > > >> > >    connected   (this is never the current host I'm trying to
> > hit,
> > > > which
> > > > >> > makes
> > > > >> > >    the error text feel a bit irrelevant to the user.  i.e., "I
> > > > wasn't
> > > > >> > trying
> > > > >> > >    to replicate a request to that node, I just want to load
> the
> > > > console
> > > > >> > on
> > > > >> > >    this node")
> > > > >> > >    - An error occurred communicating with the application
> core.
> > > > Please
> > > > >> > >    check the logs and fix any configuration issues before
> > > > restarting.
> > > > >> > (Again,
> > > > >> > >    can't find any errors in nifi-app.log or nifi-user.log)
> > > > >> > >
> > > > >> > > I can go about a half-hour reloading the page before it comes
> up
> > > > once,
> > > > >> > and
> > > > >> > > then I can only get maybe one action in before it
> auto-refreshes
> > > and
> > > > >> > shows
> > > > >> > > me one of the above error messages again.
> > > > >> > >
> > > > >> > > My first thought was that using some external zookeeper
> servers
> > > > would
> > > > >> > > improve this, but that's just a hunch.  Has anyone encountered
> > > this
> > > > >> > > behavior with high data volume?
> > > > >> > > Joe
> > > > >> > >
> > > > >> > > --
> > > > >> > > I know what it is to be in need, and I know what it is to have
> > > > >> plenty.  I
> > > > >> > > have learned the secret of being content in any and every
> > > situation,
> > > > >> > > whether well fed or hungry, whether living in plenty or in
> want.
> > > I
> > > > can
> > > > >> > > do all this through him who gives me strength.
> *-Philippians
> > > > >> 4:12-13*
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > I know what it is to be in need, and I know what it is to have
> > > > plenty.  I
> > > > >> > have learned the secret of being content in any and every
> > situation,
> > > > >> > whether well fed or hungry, whether living in plenty or in want.
> > I
> > > > can
> > > > >> do
> > > > >> > all this through him who gives me strength.    *-Philippians
> > > 4:12-13*
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > I know what it is to be in need, and I know what it is to have
> > > plenty.  I
> > > > > have learned the secret of being content in any and every
> situation,
> > > > > whether well fed or hungry, whether living in plenty or in want.  I
> > can
> > > > do
> > > > > all this through him who gives me strength.    *-Philippians
> 4:12-13*
> > > >
> > >
> > >
> > >
> > > --
> > > I know what it is to be in need, and I know what it is to have
> plenty.  I
> > > have learned the secret of being content in any and every situation,
> > > whether well fed or hungry, whether living in plenty or in want.  I can
> > do
> > > all this through him who gives me strength.    *-Philippians 4:12-13*
> > >
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
>

Re: Dealing with cluster errors

Reply via email to