Russ, It will be very important to understand what version of NiFi you're referencing on threads like this. But let me restate the core scenarios you're describing as I understand it.
Problem Statement #1 We have at times processors which appear to be stuck in that they are no longer processing data or performing their function and when we attempt to stop them they never appear to stop. Since there are active threads shown NiFi is also not allowing us to restart those processors. To overcome this we've become accustomed to simply restarting NiFi whenever it happens. Discussion for statement #1: It is important to point out that stuck processors almost entirely indicates there is a bug in that processor in that it can get itself into a situation where it acquires a thread from the controller and for one reason or another it will never relinquish it or it won't relinquish it for quite a long time such that to the user it feels effectively stuck. So then we must consider two primary actors here. So the bottom line here is that the conditions which make this happen/possible need to be solved. The things we can do to help the user here are simply workarounds to lessen the impact at the time but ultimately a stuck thread is a stuck thread and the case that could allow that must be found and fixed. So thinking for the developer we can do the following: 1) Help the developer detect such cases during initial development. Improve the mock testing framework to better provide tests for lifecycle cases. Most of the time these scenarios are brought about by improper handling of sockets and tests don't often exercise those cases. But in any event we can improve our mock framework to better test lifecycle handling. 2) Help the developer gather diagnostics/data about the condition in which they are stuck. Today this is done via obtaining thread dumps. By capturing a thread dump and waiting a bit then capturing another it is often fairly obvious which thread represents the stuck thread. In cases of live lock this can be trickier but still generally clear. We could explore some idea whereby if the framework detects a component/processor thread not doing anything productive for a period of time it will automatically obtain a series of thread dumps and capture that as diagnostics/package data for that component which will aid the developer. This is non trivial to do but certainly a reasonable step to take. Have you done the thread dump route and identified the root cause of any of the stuck thread conditions? Which processors is this happening in? One of the processors you mentioned was DistributeLoad. By default that processor will stop distributing flow files if any of its relationships are in a back pressure condition. You can switch to 'next available' which means it will distribute data in a flowing fashion whereby data will go wherever there is no back pressure. In the latest release now too you get visual indicators of back pressure which greatly help the user in understanding what is happening. So thinking for the user we can do the following: 1) Help the user identify stuck threads/components and alert them to it. Awareness is step one and potentially having an early alert to the condition will help correlate to a cause which can aid root resolution. 2) Help the user be able to 'kill/restart that component'. What is important to point out here is that in many cases we cannot get the thread back. But what we could certainly do is quarantine/isolate that thread and give the component/processor a new one to work with. Of course the condition/code that allows that to happen is still present and will likely occur again but at least this gives the user some recourse while the developers work a solution. +++ Problem Statement #2 When the content repository in NiFi has a large quantity of files contained it appears the only effective mechanism to get NiFi to restart is to blow away the contents first. Discussion for statement #2: This is presumably related to long startup times due to a large content repository needing to be evaluated. This I believe should be far more efficient in recent releases. Can you advise what release you're running on? How large of a content repository in terms of file count are you referring to? Thanks Joe On Mon, Jan 23, 2017 at 9:54 AM, Russell Bateman <[email protected]> wrote: > Can we get cadmium rods? > > Our down-streamers are complaining, and I have some experience through > testing myself, of NiFi getting into molasses where the only solution is to > bounce it. Here are some comments from my users that I hope are helpful. > > "I'm getting really burned out over the issue of when NiFi > processors get stuck, you can't get them to stop and the only > solution is |# systemctl restart ps-nifi|. I actually keep a window > open in tmux where I run this command all the time so I can just go > to that window and press up enter to restart it again." > > "I have a /DistributeLoad/ processor that was sitting there doing > nothing at all even though it said it was running. I tried > refreshing for a while, and after several minutes I finally tried > stopping the processor to see if stopping and starting it again > would help. > > "So I told it to stop, then suddenly NiFi refreshed (even though it > had been refusing to refresh for several minutes. Seems like it does > whatever it wants, when it feels like it). Then it turned out that > that processor actually HAD been running, I just couldn't see it. > Now I want to start it again, but I can't, because it has a couple > of stuck threads. So, I resort to |# systemctl restart ps-nifi|. I > know the purpose of this UI is to give us visibility into the ETL > process, but if it only gives us visibility when it feels like it, > and then it only stops a process if it feels like it, its really > annoying." > > (Of course, some of this is "point of view" and a lack of understanding > what's really going on.) > > What we do is ingest millions of medical documents including plain-text > transcripts, HL 7 pipe messages, X12 messages and CDAs (CCDs and CCDAs). > These are analyzed for all sorts of important data, transformed into an > intermediate format before being committed to a search engine and database > for retrieval. > > We've written many dozen custom processors and use many of those that come > with NiFi to perform this ETL over the last year or so, most very small, and > are very happy with the visibility NiFi gives us into what used to be a > pretty opaque and hard-to-understand ETL component. Our custom processors > range from some very specific one doing document analysis and involving > regular expressions to more general ones that do HL7, XML, X12, etc. > parsing, to invoking Tika and cTAKES. This all works very well in theory, > but as you can see, there's considerable trouble and we're having a > difficult time tuning, using careful back-pressure, etc. > > What we think we need, and we're eager for opinions here, is for NiFi to > dedicate a thread to the UI such that bouncing NiFi is no longer the only > option. We want to reach it and shut things down without the UI being held > hostage to threads burdened or hung with tasks that are far from getting > back to it. I image being able to right-click a process group and stop it > like shoving cadmium rods into a radioactive pile to scram NiFi, examine > what's going on, find and tune the parts in our flow that we had not before > understood were problematic. (Of course, what I've just said probably > betrays a lack of understanding on my part too.) > > Also, in my observation, when the quantity of files and subdirectories under > /content_repository/ gets too big, it seems to me that the only thing I can > do is to smoke them all before starting NiFi back up. > > I've been running the Java Flight Recorder attempting to spy on our NiFi > flows remotely using Java Mission Control. This isn't easily done either > because of how JFR works and my spyglass goes dark just as our users lose UI > response. > > Thoughts? > > Russ >
