In addition to Mark Payne's suggestion - which you can run and evaluate
locally, just not upload results - you can also look at several aspects of
ZooKeeper. Firstly, are you running embedded or external ZooKeeper? We have
found that external is far more reliable especially for "busy" clusters.
Likely, this is due to the external ZooKeeper having its own JVM and
associated resources.

Also, take a look at the following in nifi.properties. It's a bit more art
than science; there is no one right answer. The values depend on your
environment. But, you can begin to experiment a reasonable number of
threads, and relax some of the timeouts.

nifi.cluster.protocol.heartbeat.interval=
nifi.cluster.node.protocol.threads=
nifi.cluster.node.protocol.max.threads=
nifi.cluster.node.connection.timeout=
nifi.cluster.node.read.timeout=
nifi.cluster.node.max.concurrent.requests=

-Mark

On Wed, Jun 2, 2021 at 8:30 PM Phil H <[email protected]> wrote:

> Thanks for getting back to me Mark.
>
> Unfortunately it is running on an intranet so I can’t get logs and so forth
> off the system. Is there anything in particular I can look out for?
>
> I am running zookeeper as a separate service (not embedded) on three
> nodes.  The nifi cluster currently has two nodes.
>
> Cheers,
> Phil
>
> On Thu, 3 Jun 2021 at 10:17, Mark Payne <[email protected]> wrote:
>
> > Hey Phil,
> >
> > Can you grab a diagnostics dump from one of the nodes (preferably the
> > cluster coordinator)? Ideally grab 3 of them, with about 5 mins in
> between.
> >
> > To do that, run:
> >
> > bin/nifi.sh diagnostics <filename>
> >
> > So run something like:
> >
> > bin/nifi.sh diagnostics diagnostics1.txt
> > <wait 3-5 mins>
> > bin/nifi.sh diagnostics diagnostics2.txt
> > <wait 3-5 mins>
> > bin/nifi.sh diagnostics diagnostics3.txt
> >
> > And then upload those diagnostics text files?
> > They should not contain any sensitive information, aside from maybe file
> > paths (which most don’t consider sensitive but you may). But recommend
> you
> > glance through it to make sure that you leave out any sensitive
> information.
> >
> > Those dumps should help in understanding the problem, or at least zeroing
> > in on it.
> >
> > Also, is NiFi using its own dedicated zookeeper or is it shared with
> other
> > services? How many nodes is the zookeeper?
> >
> >
> >
> > > On Jun 2, 2021, at 7:54 PM, Phil H <[email protected]> wrote:
> > >
> > > Hi there,
> > >
> > > I am getting a lot of these both in the web interface to my servers,
> and
> > in
> > > the cluster communication between the nodes. All other aspects of the
> > > servers are fine. TCP connections to NiFi, as well as SSH connections
> to
> > > the servers are stable (running for days at a time). I’m lucky if I go
> 5
> > > minutes without the web UI dropping out or a cluster re-election due
> to a
> > > heartbeat being missed.
> > >
> > > Running 1.13.2, recently upgraded from 1.9.2. I was getting the same
> > issues
> > > with the old version, but they seem to be much more frequent now.
> > >
> > > Help!
> >
> >
>

Reply via email to