Thanks for the information! I did collect some of those diagnostics files, but nothing really jumped out at me - the system in question is not under great load or anything exciting. Is there anything specific I should be looking for?
I am running zookeeper as an external process (version 3.5.5 I think) across three nodes (one more than nifi in order to avoid deadlock, as per the admin guide). I might update this if you think it could be a source of issues. On Thu, 3 Jun 2021 at 23:09, Mark Bean <[email protected]> wrote: > In addition to Mark Payne's suggestion - which you can run and evaluate > locally, just not upload results - you can also look at several aspects of > ZooKeeper. Firstly, are you running embedded or external ZooKeeper? We have > found that external is far more reliable especially for "busy" clusters. > Likely, this is due to the external ZooKeeper having its own JVM and > associated resources. > > Also, take a look at the following in nifi.properties. It's a bit more art > than science; there is no one right answer. The values depend on your > environment. But, you can begin to experiment a reasonable number of > threads, and relax some of the timeouts. > > nifi.cluster.protocol.heartbeat.interval= > nifi.cluster.node.protocol.threads= > nifi.cluster.node.protocol.max.threads= > nifi.cluster.node.connection.timeout= > nifi.cluster.node.read.timeout= > nifi.cluster.node.max.concurrent.requests= > > -Mark > > On Wed, Jun 2, 2021 at 8:30 PM Phil H <[email protected]> wrote: > > > Thanks for getting back to me Mark. > > > > Unfortunately it is running on an intranet so I can’t get logs and so > forth > > off the system. Is there anything in particular I can look out for? > > > > I am running zookeeper as a separate service (not embedded) on three > > nodes. The nifi cluster currently has two nodes. > > > > Cheers, > > Phil > > > > On Thu, 3 Jun 2021 at 10:17, Mark Payne <[email protected]> wrote: > > > > > Hey Phil, > > > > > > Can you grab a diagnostics dump from one of the nodes (preferably the > > > cluster coordinator)? Ideally grab 3 of them, with about 5 mins in > > between. > > > > > > To do that, run: > > > > > > bin/nifi.sh diagnostics <filename> > > > > > > So run something like: > > > > > > bin/nifi.sh diagnostics diagnostics1.txt > > > <wait 3-5 mins> > > > bin/nifi.sh diagnostics diagnostics2.txt > > > <wait 3-5 mins> > > > bin/nifi.sh diagnostics diagnostics3.txt > > > > > > And then upload those diagnostics text files? > > > They should not contain any sensitive information, aside from maybe > file > > > paths (which most don’t consider sensitive but you may). But recommend > > you > > > glance through it to make sure that you leave out any sensitive > > information. > > > > > > Those dumps should help in understanding the problem, or at least > zeroing > > > in on it. > > > > > > Also, is NiFi using its own dedicated zookeeper or is it shared with > > other > > > services? How many nodes is the zookeeper? > > > > > > > > > > > > > On Jun 2, 2021, at 7:54 PM, Phil H <[email protected]> wrote: > > > > > > > > Hi there, > > > > > > > > I am getting a lot of these both in the web interface to my servers, > > and > > > in > > > > the cluster communication between the nodes. All other aspects of the > > > > servers are fine. TCP connections to NiFi, as well as SSH connections > > to > > > > the servers are stable (running for days at a time). I’m lucky if I > go > > 5 > > > > minutes without the web UI dropping out or a cluster re-election due > > to a > > > > heartbeat being missed. > > > > > > > > Running 1.13.2, recently upgraded from 1.9.2. I was getting the same > > > issues > > > > with the old version, but they seem to be much more frequent now. > > > > > > > > Help! > > > > > > > > >
