Re: Cadmium rods?

Joe Witt Mon, 23 Jan 2017 09:27:12 -0800

Russ,

It will be very important to understand what version of NiFi you're
referencing on threads like this.  But let me restate the core
scenarios you're describing as I understand it.

Problem Statement #1

We have at times processors which appear to be stuck in that they are
no longer processing data or performing their function and when we
attempt to stop them they never appear to stop.  Since there are
active threads shown NiFi is also not allowing us to restart those
processors.  To overcome this we've become accustomed to simply
restarting NiFi whenever it happens.

Discussion for statement #1:

It is important to point out that stuck processors almost entirely
indicates there is a bug in that processor in that it can get itself
into a situation where it acquires a thread from the controller and
for one reason or another it will never relinquish it or it won't
relinquish it for quite a long time such that to the user it feels
effectively stuck.  So then we must consider two primary actors here.
So the bottom line here is that the conditions which make this
happen/possible need to be solved.  The things we can do to help the
user here are simply workarounds to lessen the impact at the time but
ultimately a stuck thread is a stuck thread and the case that could
allow that must be found and fixed.

So thinking for the developer we can do the following:

1) Help the developer detect such cases during initial development.
Improve the mock testing framework to better provide tests for
lifecycle cases.  Most of the time these scenarios are brought about
by improper handling of sockets and tests don't often exercise those
cases.  But in any event we can improve our mock framework to better
test lifecycle handling.

2) Help the developer gather diagnostics/data about the condition in
which they are stuck.  Today this is done via obtaining thread dumps.
By capturing a thread dump and waiting a bit then capturing another it
is often fairly obvious which thread represents the stuck thread.  In
cases of live lock this can be trickier but still generally clear.  We
could explore some idea whereby if the framework detects a
component/processor thread not doing anything productive for a period
of time it will automatically obtain a series of thread dumps and
capture that as diagnostics/package data for that component which will
aid the developer.  This is non trivial to do but certainly a
reasonable step to take.

Have you done the thread dump route and identified the root cause of
any of the stuck thread conditions?  Which processors is this
happening in?  One of the processors you mentioned was DistributeLoad.
By default that processor will stop distributing flow files if any of
its relationships are in a back pressure condition.  You can switch to
'next available' which means it will distribute data in a flowing
fashion whereby data will go wherever there is no back pressure.  In
the latest release now too you get visual indicators of back pressure
which greatly help the user in understanding what is happening.

So thinking for the user we can do the following:

1) Help the user identify stuck threads/components and alert them to
it.  Awareness is step one and potentially having an early alert to
the condition will help correlate to a cause which can aid root
resolution.

2) Help the user be able to 'kill/restart that component'.  What is
important to point out here is that in many cases we cannot get the
thread back.  But what we could certainly do is quarantine/isolate
that thread and give the component/processor a new one to work with.
Of course the condition/code that allows that to happen is still
present and will likely occur again but at least this gives the user
some recourse while the developers work a solution.

+++

Problem Statement #2

When the content repository in NiFi has a large quantity of files
contained it appears the only effective mechanism to get NiFi to
restart is to blow away the contents first.

Discussion for statement #2:

This is presumably related to long startup times due to a large
content repository needing to be evaluated.  This I believe should be
far more efficient in recent releases.  Can you advise what release
you're running on?  How large of a content repository in terms of file
count are you referring to?

Thanks
Joe

On Mon, Jan 23, 2017 at 9:54 AM, Russell Bateman <[email protected]> wrote:
> Can we get cadmium rods?
>
> Our down-streamers are complaining, and I have some experience through
> testing myself, of NiFi getting into molasses where the only solution is to
> bounce it. Here are some comments from my users that I hope are helpful.
>
>    "I'm getting really burned out over the issue of when NiFi
>    processors get stuck, you can't get them to stop and the only
>    solution is |# systemctl restart ps-nifi|. I actually keep a window
>    open in tmux where I run this command all the time so I can just go
>    to that window and press up enter to restart it again."
>
>    "I have a /DistributeLoad/ processor that was sitting there doing
>    nothing at all even though it said it was running. I tried
>    refreshing for a while, and after several minutes I finally tried
>    stopping the processor to see if stopping and starting it again
>    would help.
>
>    "So I told it to stop, then suddenly NiFi refreshed (even though it
>    had been refusing to refresh for several minutes. Seems like it does
>    whatever it wants, when it feels like it). Then it turned out that
>    that processor actually HAD been running, I just couldn't see it.
>    Now I want to start it again, but I can't, because it has a couple
>    of stuck threads. So, I resort to |# systemctl restart ps-nifi|. I
>    know the purpose of this UI is to give us visibility into the ETL
>    process, but if it only gives us visibility when it feels like it,
>    and then it only stops a process if it feels like it, its really
>    annoying."
>
> (Of course, some of this is "point of view" and a lack of understanding
> what's really going on.)
>
> What we do is ingest millions of medical documents including plain-text
> transcripts, HL 7 pipe messages, X12 messages and CDAs (CCDs and CCDAs).
> These are analyzed for all sorts of important data, transformed into an
> intermediate format before being committed to a search engine and database
> for retrieval.
>
> We've written many dozen custom processors and use many of those that come
> with NiFi to perform this ETL over the last year or so, most very small, and
> are very happy with the visibility NiFi gives us into what used to be a
> pretty opaque and hard-to-understand ETL component. Our custom processors
> range from some very specific one doing document analysis and involving
> regular expressions to more general ones that do HL7, XML, X12, etc.
> parsing, to invoking Tika and cTAKES. This all works very well in theory,
> but as you can see, there's considerable trouble and we're having a
> difficult time tuning, using careful back-pressure, etc.
>
> What we think we need, and we're eager for opinions here, is for NiFi to
> dedicate a thread to the UI such that bouncing NiFi is no longer the only
> option. We want to reach it and shut things down without the UI being held
> hostage to threads burdened or hung with tasks that are far from getting
> back to it. I image being able to right-click a process group and stop it
> like shoving cadmium rods into a radioactive pile to scram NiFi, examine
> what's going on, find and tune the parts in our flow that we had not before
> understood were problematic. (Of course, what I've just said probably
> betrays a lack of understanding on my part too.)
>
> Also, in my observation, when the quantity of files and subdirectories under
> /content_repository/ gets too big, it seems to me that the only thing I can
> do is to smoke them all before starting NiFi back up.
>
> I've been running the Java Flight Recorder attempting to spy on our NiFi
> flows remotely using Java Mission Control. This isn't easily done either
> because of how JFR works and my spyglass goes dark just as our users lose UI
> response.
>
> Thoughts?
>
> Russ
>

Re: Cadmium rods?

Reply via email to