Hi,

Recently we switched the server we run nifi on from a 24 core server to a 4
core one, and since then approximately 4 times a day nifi stops responding
until it is restarted . Then we switched to an 8 cores server, and now it
happens approximately every 2 days.

When this happens, the UI becomes unresponsive, as well as the rest api.
The number of nifi active threads metric returns 0 active threads, and the
cpu is at 100% idle. There is not large spike in flowfiles, memory or cpu
usage before nifi stops responding. But, when we checked the provenance
repo we saw that events were getting created. The logs only show that
events are being created, there are no errors or warnings. By looking into
the content of the events we were able to determine that events were
flowing up until a processor using the RedisConnectionPoolService.

We tried to connect with the debugger to different processors and all of
them, except 4, responded and the debugger connected successfully.
The other 4 are using the RedisConnectionPoolService, and they didn't
respond. 2 of these processors are custom ones we wrote, the other 2 are
the built in wait-notify mechanism. When we tried to connect to the
RedisConnectionPoolService the debugger wasn't able to connect to it as
well. The redis service that the connection pool is connected to responds
to us normally.

We tried to look at the active threads using /opt/nifi/bin/nifi.sh dump,
but we did not see anything strange.

When we tried to dig into the problem we noticed that nifi uses an old
version of spring-data-redis. We don't know if this is the problem but we
opened an issue for this: https://issues.apache.org/jira/browse/NIFI-4811u

The maximum timer driven thread count is the default (10). Our custom
processors are configured to a maximum of 10 concurrent tasks, and the
wait/notify processors are configured to 5. The RedisConnectionPoolService
is configured with the default values:
Max Total: 20
Max Idle: 8
Min Idle: 0
Block When Exhausted: true
Max Evictable Idle Time: 60 seconds
Time Between Eviction Runs: 30 seconds
Num Tests Per Eviction Run: -1

We made sure to always call connection.close() in our custom made
processors.
Is it possible that somehow connections are not released or evicted, and
that is why nifi freezes like this? How can we determine that this is the
case?

Thanks!
Daniel

Reply via email to