As promised, and as embarrassing as it seems now, I'm reporting what happened...

It appears that one of our IT guys failed to type /G/ when he created the swap partition on this staging server and it ended up sized at 128M instead of 128G. (Fortunately, it's not a production server and I think we have safeties in place to guard against similaring screwing up those installations.)


This turned up visibly using htop. Unfortunately, though htopwas an early tool in our quest for what wasn't right, we were thinking it was something in NiFi, one of our processors in the flow, etc., concentrating on that angle, and we weren't looking at the top section output of htopuntil after poring through NiFi logs and eliminating all other suspicions.

Live and learn.

Russ

On 10/05/2016 06:21 PM, Andrew Grande wrote:
Just a sanity check, number of open file handles increased as per
quickstart document? Might need much more for your flow.

Another tip, when your server experiences undesired hiccups like that try
running 'nifi.sh dump save-in-this-file.txt' and investigate/share where
NiFi threads are being held back.

Andrew

On Tue, Oct 4, 2016, 10:54 AM Russell Bateman <
russell.bate...@perfectsearchcorp.com> wrote:

We use the templating to create FHIR XML, in this case, a

     <Binary>
        ...
        <content value="$flowfile_contents" />
     </Binary>

construct that includes a base-64 encoding of a PDF, the flowfile
contents coming into the templating processor. These can get to be
megabytes in size though our sample data was just under 1Mb.

Yesterday, I built a new, reduced flow restricting the use of my
/VelocityTemplating/ processor to perform only the part of that task
that I suspected would be taking so much time, that is, copying the
base-64 data into the template in place of the VTL macro. However, I
could not reproduce the problem though I did this on the very production
server (actually, more of a staging server, but it was the very server
where the trouble was detected in the first place).

Predictably (that is if, like me, you believe Murphy reigns supreme in
this universe), the action using the very files in question took
virtually no time at all, just as had been my experience running on my
local development host. I then slightly expanded the new flow to take in
some of the other trappings of the original one (but, it was the
templating that was reported as being the bottleneck--minutes to fill
out the template instead of milliseconds). In short, I could not
replicate the problem. True, the moon is in a different phase than late
last week when this was reported.

I will come back here and report if and when we stumble upon this, it
reoccurs and/or we took a decision about anything, for the benefit of
the community. At present, we're looking to force re-ingestion of the
run, using the original flow design, including the documents that
reportedly experienced this trouble to see if it happens yet again.

In the meantime, I can say:

     - I keep no state in this processor (indeed, I try not to and don't
     think I have anything stateful in any of our custom processors).
     - The server runs some 40 cores, 128Gb RAM on 12Tb of disk,
     dedicated hardware, CentOS 7, recently built and installed.
     - Reportedly, I learned, little else was going on on the server at
     the same time, either in NiFi or elsewhere.
     - NiFi heap is configured to be 12Gb.
     - Not so far along yet as to understand thread usage or garbage
     collection state.

Again, thanks for the suggestions from both of you.

Russ


On 10/03/2016 06:28 PM, Joe Witt wrote:
Russ,

As Jeff points out lack of available threads could be a factor flow
slower processing times but this would manifest itself by you seeing
that the processor isn't running very often.  If it is that the
process itself when executing takes much longer than on the other box
then it is probably best to look at some other culprits.  To check
this out you can view the status history and look at the average
number of tasks and average task time for this process.  Does it look
right to you in terms of how often it runs, how long it takes, and is
the amount of time it takes growing?

If you find that performance of this processor itself is slowing then
consider a few things.
1) Does it maintain some internal state and if so is the data
structure it is using efficient for lookups?
2) How does your heap look?  Is there a lot of garbage collection
activity?  Are there any full garbage collections and if so how often?
   It should generally be the case in a well configured and designed
system that full garbage collections never occur (ever).
3) Attaching a remote debugger and/or running profilers on it can be
really illuminating.

JOe

On Mon, Oct 3, 2016 at 11:26 AM, Jeff <jtsw...@gmail.com> wrote:
Russel,

This sounds like it's an environmental issue.  Are you able to see the
heap
usage on the production machine?  Are there enough available threads to
get
the throughput you are observing when you run locally?  Have you
double-checked the scheduling tab on the processor config to make sure
it
is running as aggressively as it runs locally?

I have run into this sort of thing before, and it was because of
flowfile
congestion in other areas of the flow, and there were no threads
available
for other processors to get through their own queues.

Just trying to think through some of the obvious/high level things that
might be affecting your flow...

- Jeff

On Mon, Oct 3, 2016 at 9:43 AM Russell Bateman <
russell.bate...@perfectsearchcorp.com> wrote:

We use NiFi for an ETL feed. On one of the lines, we use a custom
processor, *VelocityTemplating* (calls Apache Velocity), which works
very
well and indeed is imperceptibly fast when run locally on the same data
(template, VTL macros, substitution fodder). However, in production
it's
another matter. What takes no time at all in local runs takes minutes
in
that environment.

I'm looking for suggestions as to a) why this might be and b) how best
to
go about examining/debugging it. I think I will soon have
remote-access to
the production machine (a VPN must be set up).

Thanks,

Russ



Reply via email to