Re: Operational Deployment/Garbage Collection

Joe Witt Wed, 07 Jan 2015 08:55:57 -0800

Mike,

We hope to discuss this in much more detail as we make progress toward the
administration guide.  But we are certainly susceptible to GC behaviors
which can impact performance.  That is particularly true because of the
extension points which folks can build to (processors, controller tasks,
etc..).  We've taken great care to be as memory efficient as possible in
all of our internal framework components and the existing standard
processors.  In short, everything is designed to handle arbitrarily large
objects without every loading more than some finite and relatively small
amount of memory at once.

Where this breaks down as we currently have it is the FlowFile objects
themselves.  For each flow file that is active in the flow we have the
entire Map of attribute key/value String pairs loaded with the FlowFIle
object.  So while we do not have the actual content of the flowfile in
memory we do have those Maps and a few small values with each.  If there
are dozens or hundreds of large keys/values across hundreds of thousands if
not many millions of flow files then that can start to eat into heap usage
considerably.  We do combat this fairly well with a concept called
'flowfile swapping'.  If a queue backlogs beyond a configurable threshold
we actually serialize the excess flowfiles out to storage (off heap).  This
allows for massive backlogs to be gracefully handled.  But this mechanism
is still arguably crude as it is purely based on 'number of flow files' and
in reality there can be great variability in the "Heap cost" of any flow
file and that depends on the number of and size of the attributes.

The key stressors of the heap:
- Is it large enough for all the normal goings on in a flow?
-- If yes great.  If no then no matter what things will be no fun..  The
size needed depends on how many things are in the flow, how many flow files
can be around at once, the sophistication of the processors in the flow.

-- Are most objects created of a relatively short life span?  If yes
great.  If not then it creates a different of tension on garbage
collection.  G1 tends to handle even this fairly well but still folks
should strive to have objects as short lived as possible.

-- Are all operations against content (which could be arbitrarily large)
done so in a manner which only ever has some finite amount in the heap at a
time?  This is by far the single biggest gotcha we see related to garbage
collection issues.  It is imperative that if one wants to see their JVM
stay performant that they be very cognizant of being buffer 'stream'
friendly rather than using byte[] to hold large objects.

I've run with G1 very successfully for a very long time and if I write the
documentation for this I would recommend its use.

I've put together a couple of 'Stress Test' style templates that people can
run on their configured system to get a sense of memory load for well
behaved processors and framework components.  Hopefully that will help put
some real information behind such a discussion.  We can also update the
GenerateFlowFile processor to have what would be considered bad behaviors
so folks can plainly see the effects of bad memory practices.

Was this rambling even close to what you were looking for?

Thanks
Joe

On Wed, Jan 7, 2015 at 11:38 AM, Mike Drob <[email protected]> wrote:

> Are there operational guidelines somewhere on heap sizing and garbage
> collection when deploying NiFi?
>
> There's a lot of common wisdom about how to avoid full GCs (which I assume
> are as bad for NiFi as they are for any Java application) but I was curious
> what people had experience running with.
>
> CMS? G1? C4? Are there recommended options to enable/disable based on how
> NiFi runs for a smoother experience?
>
> Mike
>

Re: Operational Deployment/Garbage Collection

Reply via email to