Re: Processor running slow in production, not locally

Russell Bateman Thu, 13 Oct 2016 14:37:11 -0700

As promised, and as embarrassing as it seems now, I'm reporting whathappened...

It appears that one of our IT guys failed to type /G/ when he createdthe swap partition on this staging server and it ended up sized at 128Minstead of 128G. (Fortunately, it's not a production server and I thinkwe have safeties in place to guard against similaring screwing up thoseinstallations.)

This turned up visibly using htop. Unfortunately, though htopwas anearly tool in our quest for what wasn't right, we were thinking it wassomething in NiFi, one of our processors in the flow, etc.,concentrating on that angle, and we weren't looking at the top sectionoutput of htopuntil after poring through NiFi logs and eliminating allother suspicions.


Live and learn.

Russ

On 10/05/2016 06:21 PM, Andrew Grande wrote:

Just a sanity check, number of open file handles increased as per
quickstart document? Might need much more for your flow.

Another tip, when your server experiences undesired hiccups like that try
running 'nifi.sh dump save-in-this-file.txt' and investigate/share where
NiFi threads are being held back.

Andrew

On Tue, Oct 4, 2016, 10:54 AM Russell Bateman <
[email protected]> wrote:

We use the templating to create FHIR XML, in this case, a

     <Binary>
        ...
        <content value="$flowfile_contents" />
     </Binary>

construct that includes a base-64 encoding of a PDF, the flowfile
contents coming into the templating processor. These can get to be
megabytes in size though our sample data was just under 1Mb.

Yesterday, I built a new, reduced flow restricting the use of my
/VelocityTemplating/ processor to perform only the part of that task
that I suspected would be taking so much time, that is, copying the
base-64 data into the template in place of the VTL macro. However, I
could not reproduce the problem though I did this on the very production
server (actually, more of a staging server, but it was the very server
where the trouble was detected in the first place).

Predictably (that is if, like me, you believe Murphy reigns supreme in
this universe), the action using the very files in question took
virtually no time at all, just as had been my experience running on my
local development host. I then slightly expanded the new flow to take in
some of the other trappings of the original one (but, it was the
templating that was reported as being the bottleneck--minutes to fill
out the template instead of milliseconds). In short, I could not
replicate the problem. True, the moon is in a different phase than late
last week when this was reported.

I will come back here and report if and when we stumble upon this, it
reoccurs and/or we took a decision about anything, for the benefit of
the community. At present, we're looking to force re-ingestion of the
run, using the original flow design, including the documents that
reportedly experienced this trouble to see if it happens yet again.

In the meantime, I can say:

     - I keep no state in this processor (indeed, I try not to and don't
     think I have anything stateful in any of our custom processors).
     - The server runs some 40 cores, 128Gb RAM on 12Tb of disk,
     dedicated hardware, CentOS 7, recently built and installed.
     - Reportedly, I learned, little else was going on on the server at
     the same time, either in NiFi or elsewhere.
     - NiFi heap is configured to be 12Gb.
     - Not so far along yet as to understand thread usage or garbage
     collection state.

Again, thanks for the suggestions from both of you.

Russ


On 10/03/2016 06:28 PM, Joe Witt wrote:

Russ,

As Jeff points out lack of available threads could be a factor flow
slower processing times but this would manifest itself by you seeing
that the processor isn't running very often.  If it is that the
process itself when executing takes much longer than on the other box
then it is probably best to look at some other culprits.  To check
this out you can view the status history and look at the average
number of tasks and average task time for this process.  Does it look
right to you in terms of how often it runs, how long it takes, and is
the amount of time it takes growing?

If you find that performance of this processor itself is slowing then
consider a few things.
1) Does it maintain some internal state and if so is the data
structure it is using efficient for lookups?
2) How does your heap look?  Is there a lot of garbage collection
activity?  Are there any full garbage collections and if so how often?
   It should generally be the case in a well configured and designed
system that full garbage collections never occur (ever).
3) Attaching a remote debugger and/or running profilers on it can be
really illuminating.

JOe

On Mon, Oct 3, 2016 at 11:26 AM, Jeff <[email protected]> wrote:

Russel,

This sounds like it's an environmental issue.  Are you able to see the

heap

usage on the production machine?  Are there enough available threads to

get

the throughput you are observing when you run locally?  Have you
double-checked the scheduling tab on the processor config to make sure

it

is running as aggressively as it runs locally?

I have run into this sort of thing before, and it was because of

flowfile

congestion in other areas of the flow, and there were no threads

available

for other processors to get through their own queues.

Just trying to think through some of the obvious/high level things that
might be affecting your flow...

- Jeff

On Mon, Oct 3, 2016 at 9:43 AM Russell Bateman <
[email protected]> wrote:

We use NiFi for an ETL feed. On one of the lines, we use a custom
processor, *VelocityTemplating* (calls Apache Velocity), which works

very

well and indeed is imperceptibly fast when run locally on the same data
(template, VTL macros, substitution fodder). However, in production

it's

another matter. What takes no time at all in local runs takes minutes

in

that environment.

I'm looking for suggestions as to a) why this might be and b) how best

to

go about examining/debugging it. I think I will soon have

remote-access to

the production machine (a VPN must be set up).

Thanks,

Russ

Re: Processor running slow in production, not locally

Reply via email to