[jira] [Created] (NIFI-3648) Address Excessive Garbage Collection

Mark Payne (JIRA) Fri, 24 Mar 2017 13:21:05 -0700

Mark Payne created NIFI-3648:
--------------------------------

             Summary: Address Excessive Garbage Collection
                 Key: NIFI-3648
                 URL: https://issues.apache.org/jira/browse/NIFI-3648
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Core Framework, Extensions
            Reporter: Mark Payne
            Assignee: Mark Payne



We have a lot of places in the codebase where we generate lots of unnecessary 
garbage - especially byte arrays. We need to clean this up to in order to 
relieve stress on the garbage collector.

Specific points that I've found create unnecessary garbage:

Provenance CompressableRecordWriter creates a new BufferedOutputStream for each 
'compression block' that it creates. Each one has a 64 KB byte[]. This is very 
wasteful. We should instead subclass BufferedOutputStream so that we are able 
to provide a byte[] to use instead of an int that indicates the size. This way, 
we can just keep re-using the same byte[] that we create for each writer. This 
saves about 32,000 of these 64 KB byte[] for each writer that we create. And we 
create more than 1 of these per minute.

EvaluateJsonPath uses a BufferedInputStream but it is not necessary, because 
the underlying library will also buffer data. So we are unnecessarily creating 
a lot of byte[]'s

CompressContent uses Buffered Input AND Output. And uses 64 KB byte[]. And 
doesn't need them at all, because it reads and writes with its own byte[] 
buffer via StreamUtils.copy

Site-to-site uses CompressionInputStream. This stream creates a new byte[] in 
the readChunkHeader() method continually. We should instead only create a new 
byte[] if we need a bigger buffer and otherwise just use an offset & length 
variable.

Right now, SplitText uses TextLineDemarcator. The fill() method increases the 
size of the internal byte[] by 8 KB each time. When dealing with a large chunk 
of data, this is VERY expensive on GC because we continually create a byte[] 
and then discard it to create a new one. Take for example an 800 KB chunk. We 
would do this 100,000 times. If we instead double the size we would only have 
to create 8 of these.

Other Processors that use Buffered streams unnecessarily:

ConvertJSONToSQL
ExecuteProcess
ExecuteStreamCommand
AttributesToJSON
EvaluateJsonPath (when writing to content)
ExtractGrok
JmsConsumer




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (NIFI-3648) Address Excessive Garbage Collection

Reply via email to