Mike, We hope to discuss this in much more detail as we make progress toward the administration guide. But we are certainly susceptible to GC behaviors which can impact performance. That is particularly true because of the extension points which folks can build to (processors, controller tasks, etc..). We've taken great care to be as memory efficient as possible in all of our internal framework components and the existing standard processors. In short, everything is designed to handle arbitrarily large objects without every loading more than some finite and relatively small amount of memory at once.
Where this breaks down as we currently have it is the FlowFile objects themselves. For each flow file that is active in the flow we have the entire Map of attribute key/value String pairs loaded with the FlowFIle object. So while we do not have the actual content of the flowfile in memory we do have those Maps and a few small values with each. If there are dozens or hundreds of large keys/values across hundreds of thousands if not many millions of flow files then that can start to eat into heap usage considerably. We do combat this fairly well with a concept called 'flowfile swapping'. If a queue backlogs beyond a configurable threshold we actually serialize the excess flowfiles out to storage (off heap). This allows for massive backlogs to be gracefully handled. But this mechanism is still arguably crude as it is purely based on 'number of flow files' and in reality there can be great variability in the "Heap cost" of any flow file and that depends on the number of and size of the attributes. The key stressors of the heap: - Is it large enough for all the normal goings on in a flow? -- If yes great. If no then no matter what things will be no fun.. The size needed depends on how many things are in the flow, how many flow files can be around at once, the sophistication of the processors in the flow. -- Are most objects created of a relatively short life span? If yes great. If not then it creates a different of tension on garbage collection. G1 tends to handle even this fairly well but still folks should strive to have objects as short lived as possible. -- Are all operations against content (which could be arbitrarily large) done so in a manner which only ever has some finite amount in the heap at a time? This is by far the single biggest gotcha we see related to garbage collection issues. It is imperative that if one wants to see their JVM stay performant that they be very cognizant of being buffer 'stream' friendly rather than using byte[] to hold large objects. I've run with G1 very successfully for a very long time and if I write the documentation for this I would recommend its use. I've put together a couple of 'Stress Test' style templates that people can run on their configured system to get a sense of memory load for well behaved processors and framework components. Hopefully that will help put some real information behind such a discussion. We can also update the GenerateFlowFile processor to have what would be considered bad behaviors so folks can plainly see the effects of bad memory practices. Was this rambling even close to what you were looking for? Thanks Joe On Wed, Jan 7, 2015 at 11:38 AM, Mike Drob <[email protected]> wrote: > Are there operational guidelines somewhere on heap sizing and garbage > collection when deploying NiFi? > > There's a lot of common wisdom about how to avoid full GCs (which I assume > are as bad for NiFi as they are for any Java application) but I was curious > what people had experience running with. > > CMS? G1? C4? Are there recommended options to enable/disable based on how > NiFi runs for a smoother experience? > > Mike >
