[ 
https://issues.apache.org/jira/browse/NIFI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Payne updated NIFI-5533:
-----------------------------
    Fix Version/s: 1.8.0
           Status: Patch Available  (was: Open)

> Improve efficiency of FlowFiles' heap usage
> -------------------------------------------
>
>                 Key: NIFI-5533
>                 URL: https://issues.apache.org/jira/browse/NIFI-5533
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Looking at the code, I see several places that we can improve the heap that 
> NiFi uses for FlowFiles:
>  * When StandardPreparedQuery is used (any time Expression Language is 
> evaluated), it creates a StringBuilder and iterates over all Expressions, 
> evaluating them and concatenating the results together. If there is only a 
> single Expression, though, we can avoid this and just return the value 
> obtained from the Expression. While this will improve the amount of garbage 
> collected, it plays a more important role: it avoids creating a new String 
> object for the FlowFile's attribute map. Currently, if 1 million FlowFiles go 
> through UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1 
> million copies of that String on the heap. If we just returned the result of 
> evaluating the Expression, we would instead have 1 copy of that String.
>  * Similar to above, it may make sense in UpdateAttribute to cache N number 
> of entries, so that when an expression like ${filename}.txt is evaluated, 
> even though a new String is generated by StandardPreparedQuery, we can 
> resolve that to the same String object when storing as a FlowFile attribute. 
> This would work similar to {{String.intern()}} but not use 
> {{String.intern()}} because we don't want to store an unbounded number of 
> these values in the {{String.intern()}} cache - we want to cap the number of 
> entries, in case the values aren't always reused.
>  * Every FlowFile that is created by StandardProcessSession has a 'filename' 
> attribute added. The value is obtained by calling 
> {{String.valueOf(System.nanoTime());}} This comes with a few downsides. 
> Firstly, the system call is a bit expensive (though not bad). Secondly, the 
> filename is not very unique - it's common with many dataflows and concurrent 
> tasks running to have several FlowFiles with 'naming collisions'. Most of 
> all, though, it means that we are keeping that String on the heap. A simple 
> test shows that instead using the UUID as the default filename resulted in 
> allowing 20% more FlowFiles to be generated on the same heap before running 
> out of heap.
>  * {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap 
> for every call. If we instead created a copy of it once when the 
> StandardProcessContext was created, we could instead just return that one Map 
> every time, since it can't change over the lifetime of the ProcessContext. 
> This is more about garbage collection and general processor performance than 
> about heap utilization but still in the same realm.
> I am sure that there are far more of these nuances but these are certainly 
> worth tackling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to