On Wed, Jan 7, 2015 at 10:54 AM, Joe Witt <[email protected]> wrote:
> Mike, > > We hope to discuss this in much more detail as we make progress toward the > administration guide. But we are certainly susceptible to GC behaviors > which can impact performance. That is particularly true because of the > extension points which folks can build to (processors, controller tasks, > etc..). We've taken great care to be as memory efficient as possible in > all of our internal framework components and the existing standard > processors. In short, everything is designed to handle arbitrarily large > objects without every loading more than some finite and relatively small > amount of memory at once. > > Yea, capturing all of this in user/operator facing documentation is probably the best end-goal. I can file a JIRA if one does not already exist. > Where this breaks down as we currently have it is the FlowFile objects > themselves. For each flow file that is active in the flow we have the > entire Map of attribute key/value String pairs loaded with the FlowFIle > object. So while we do not have the actual content of the flowfile in > memory we do have those Maps and a few small values with each. If there > are dozens or hundreds of large keys/values across hundreds of thousands if > not many millions of flow files then that can start to eat into heap usage > considerably. We do combat this fairly well with a concept called > 'flowfile swapping'. If a queue backlogs beyond a configurable threshold > we actually serialize the excess flowfiles out to storage (off heap). This > allows for massive backlogs to be gracefully handled. But this mechanism > is still arguably crude as it is purely based on 'number of flow files' and > in reality there can be great variability in the "Heap cost" of any flow > file and that depends on the number of and size of the attributes. > Are there metrics kept on flow file metadata? I recall seeing # of flow files, but it would be cool to see summary statistics on number of attributes, memory footprint per flow file, etc. Apologies if this already exists, I haven't gone looking yet. Maybe JMX is a good place for these. > > The key stressors of the heap: > - Is it large enough for all the normal goings on in a flow? > -- If yes great. If no then no matter what things will be no fun.. The > size needed depends on how many things are in the flow, how many flow files > can be around at once, the sophistication of the processors in the flow. > > -- Are most objects created of a relatively short life span? If yes > great. If not then it creates a different of tension on garbage > collection. G1 tends to handle even this fairly well but still folks > should strive to have objects as short lived as possible. > > -- Are all operations against content (which could be arbitrarily large) > done so in a manner which only ever has some finite amount in the heap at a > time? This is by far the single biggest gotcha we see related to garbage > collection issues. It is imperative that if one wants to see their JVM > stay performant that they be very cognizant of being buffer 'stream' > friendly rather than using byte[] to hold large objects. > I could come up with several scenarios (i.e. do this or that) to ask about, but I think I'll be better served by just looking at existing processors as exemplars. I'll come back with more questions after I've read the source. > > I've run with G1 very successfully for a very long time and if I write the > documentation for this I would recommend its use. Good to know. > > I've put together a couple of 'Stress Test' style templates that people can > run on their configured system to get a sense of memory load for well > behaved processors and framework components. Hopefully that will help put > some real information behind such a discussion. We can also update the > GenerateFlowFile processor to have what would be considered bad behaviors > so folks can plainly see the effects of bad memory practices. > This is very cool. I would make the bad behaviours optional, but otherwise that is an incredibly clever idea. I love it. > > Was this rambling even close to what you were looking for? > Yes, very informative. Thank you. > > Thanks > Joe > > On Wed, Jan 7, 2015 at 11:38 AM, Mike Drob <[email protected]> wrote: > > > Are there operational guidelines somewhere on heap sizing and garbage > > collection when deploying NiFi? > > > > There's a lot of common wisdom about how to avoid full GCs (which I > assume > > are as bad for NiFi as they are for any Java application) but I was > curious > > what people had experience running with. > > > > CMS? G1? C4? Are there recommended options to enable/disable based on how > > NiFi runs for a smoother experience? > > > > Mike > > > Mike
