Mark Payne created NIFI-5533:
--------------------------------

             Summary: Improve efficiency of FlowFiles' heap usage
                 Key: NIFI-5533
                 URL: https://issues.apache.org/jira/browse/NIFI-5533
             Project: Apache NiFi
          Issue Type: Improvement
            Reporter: Mark Payne
            Assignee: Mark Payne


Looking at the code, I see several places that we can improve the heap that 
NiFi uses for FlowFiles:
 * When StandardPreparedQuery is used (any time Expression Language is 
evaluated), it creates a StringBuilder and iterates over all Expressions, 
evaluating them and concatenating the results together. If there is only a 
single Expression, though, we can avoid this and just return the value obtained 
from the Expression. While this will improve the amount of garbage collected, 
it plays a more important role: it avoids creating a new String object for the 
FlowFile's attribute map. Currently, if 1 million FlowFiles go through 
UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1 million copies 
of that String on the heap. If we just returned the result of evaluating the 
Expression, we would instead have 1 copy of that String.
 * Similar to above, it may make sense in UpdateAttribute to cache N number of 
entries, so that when an expression like ${filename}.txt is evaluated, even 
though a new String is generated by StandardPreparedQuery, we can resolve that 
to the same String object when storing as a FlowFile attribute. This would work 
similar to {{String.intern()}} but not use {{String.intern()}} because we don't 
want to store an unbounded number of these values in the {{String.intern()}} 
cache - we want to cap the number of entries, in case the values aren't always 
reused.
 * Every FlowFile that is created by StandardProcessSession has a 'filename' 
attribute added. The value is obtained by calling 
{{String.valueOf(System.nanoTime());}} This comes with a few downsides. 
Firstly, the system call is a bit expensive (though not bad). Secondly, the 
filename is not very unique - it's common with many dataflows and concurrent 
tasks running to have several FlowFiles with 'naming collisions'. Most of all, 
though, it means that we are keeping that String on the heap. A simple test 
shows that instead using the UUID as the default filename resulted in allowing 
20% more FlowFiles to be generated on the same heap before running out of heap.
 * {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap for 
every call. If we instead created a copy of it once when the 
StandardProcessContext was created, we could instead just return that one Map 
every time, since it can't change over the lifetime of the ProcessContext. This 
is more about garbage collection and general processor performance than about 
heap utilization but still in the same realm.

I am sure that there are far more of these nuances but these are certainly 
worth tackling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to