[
https://issues.apache.org/jira/browse/NIFI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Payne updated NIFI-5533:
-----------------------------
Fix Version/s: 1.8.0
Status: Patch Available (was: Open)
> Improve efficiency of FlowFiles' heap usage
> -------------------------------------------
>
> Key: NIFI-5533
> URL: https://issues.apache.org/jira/browse/NIFI-5533
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 1.8.0
>
>
> Looking at the code, I see several places that we can improve the heap that
> NiFi uses for FlowFiles:
> * When StandardPreparedQuery is used (any time Expression Language is
> evaluated), it creates a StringBuilder and iterates over all Expressions,
> evaluating them and concatenating the results together. If there is only a
> single Expression, though, we can avoid this and just return the value
> obtained from the Expression. While this will improve the amount of garbage
> collected, it plays a more important role: it avoids creating a new String
> object for the FlowFile's attribute map. Currently, if 1 million FlowFiles go
> through UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1
> million copies of that String on the heap. If we just returned the result of
> evaluating the Expression, we would instead have 1 copy of that String.
> * Similar to above, it may make sense in UpdateAttribute to cache N number
> of entries, so that when an expression like ${filename}.txt is evaluated,
> even though a new String is generated by StandardPreparedQuery, we can
> resolve that to the same String object when storing as a FlowFile attribute.
> This would work similar to {{String.intern()}} but not use
> {{String.intern()}} because we don't want to store an unbounded number of
> these values in the {{String.intern()}} cache - we want to cap the number of
> entries, in case the values aren't always reused.
> * Every FlowFile that is created by StandardProcessSession has a 'filename'
> attribute added. The value is obtained by calling
> {{String.valueOf(System.nanoTime());}} This comes with a few downsides.
> Firstly, the system call is a bit expensive (though not bad). Secondly, the
> filename is not very unique - it's common with many dataflows and concurrent
> tasks running to have several FlowFiles with 'naming collisions'. Most of
> all, though, it means that we are keeping that String on the heap. A simple
> test shows that instead using the UUID as the default filename resulted in
> allowing 20% more FlowFiles to be generated on the same heap before running
> out of heap.
> * {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap
> for every call. If we instead created a copy of it once when the
> StandardProcessContext was created, we could instead just return that one Map
> every time, since it can't change over the lifetime of the ProcessContext.
> This is more about garbage collection and general processor performance than
> about heap utilization but still in the same realm.
> I am sure that there are far more of these nuances but these are certainly
> worth tackling.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)