Mark Payne created NIFI-5533:
--------------------------------
Summary: Improve efficiency of FlowFiles' heap usage
Key: NIFI-5533
URL: https://issues.apache.org/jira/browse/NIFI-5533
Project: Apache NiFi
Issue Type: Improvement
Reporter: Mark Payne
Assignee: Mark Payne
Looking at the code, I see several places that we can improve the heap that
NiFi uses for FlowFiles:
* When StandardPreparedQuery is used (any time Expression Language is
evaluated), it creates a StringBuilder and iterates over all Expressions,
evaluating them and concatenating the results together. If there is only a
single Expression, though, we can avoid this and just return the value obtained
from the Expression. While this will improve the amount of garbage collected,
it plays a more important role: it avoids creating a new String object for the
FlowFile's attribute map. Currently, if 1 million FlowFiles go through
UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1 million copies
of that String on the heap. If we just returned the result of evaluating the
Expression, we would instead have 1 copy of that String.
* Similar to above, it may make sense in UpdateAttribute to cache N number of
entries, so that when an expression like ${filename}.txt is evaluated, even
though a new String is generated by StandardPreparedQuery, we can resolve that
to the same String object when storing as a FlowFile attribute. This would work
similar to {{String.intern()}} but not use {{String.intern()}} because we don't
want to store an unbounded number of these values in the {{String.intern()}}
cache - we want to cap the number of entries, in case the values aren't always
reused.
* Every FlowFile that is created by StandardProcessSession has a 'filename'
attribute added. The value is obtained by calling
{{String.valueOf(System.nanoTime());}} This comes with a few downsides.
Firstly, the system call is a bit expensive (though not bad). Secondly, the
filename is not very unique - it's common with many dataflows and concurrent
tasks running to have several FlowFiles with 'naming collisions'. Most of all,
though, it means that we are keeping that String on the heap. A simple test
shows that instead using the UUID as the default filename resulted in allowing
20% more FlowFiles to be generated on the same heap before running out of heap.
* {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap for
every call. If we instead created a copy of it once when the
StandardProcessContext was created, we could instead just return that one Map
every time, since it can't change over the lifetime of the ProcessContext. This
is more about garbage collection and general processor performance than about
heap utilization but still in the same realm.
I am sure that there are far more of these nuances but these are certainly
worth tackling.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)