Hi, Recently we switched the server we run nifi on from a 24 core server to a 4 core one, and since then approximately 4 times a day nifi stops responding until it is restarted . Then we switched to an 8 cores server, and now it happens approximately every 2 days.
When this happens, the UI becomes unresponsive, as well as the rest api. The number of nifi active threads metric returns 0 active threads, and the cpu is at 100% idle. There is not large spike in flowfiles, memory or cpu usage before nifi stops responding. But, when we checked the provenance repo we saw that events were getting created. The logs only show that events are being created, there are no errors or warnings. By looking into the content of the events we were able to determine that events were flowing up until a processor using the RedisConnectionPoolService. We tried to connect with the debugger to different processors and all of them, except 4, responded and the debugger connected successfully. The other 4 are using the RedisConnectionPoolService, and they didn't respond. 2 of these processors are custom ones we wrote, the other 2 are the built in wait-notify mechanism. When we tried to connect to the RedisConnectionPoolService the debugger wasn't able to connect to it as well. The redis service that the connection pool is connected to responds to us normally. We tried to look at the active threads using /opt/nifi/bin/nifi.sh dump, but we did not see anything strange. When we tried to dig into the problem we noticed that nifi uses an old version of spring-data-redis. We don't know if this is the problem but we opened an issue for this: https://issues.apache.org/jira/browse/NIFI-4811u The maximum timer driven thread count is the default (10). Our custom processors are configured to a maximum of 10 concurrent tasks, and the wait/notify processors are configured to 5. The RedisConnectionPoolService is configured with the default values: Max Total: 20 Max Idle: 8 Min Idle: 0 Block When Exhausted: true Max Evictable Idle Time: 60 seconds Time Between Eviction Runs: 30 seconds Num Tests Per Eviction Run: -1 We made sure to always call connection.close() in our custom made processors. Is it possible that somehow connections are not released or evicted, and that is why nifi freezes like this? How can we determine that this is the case? Thanks! Daniel
