[ 
https://issues.apache.org/jira/browse/NIFI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596498#comment-16596498
 ] 

Mark Payne commented on NIFI-5533:
----------------------------------

The primary case that I was attempting to improve here is the case of SplitText 
(or Split* processors in general) that split a single FlowFile into thousands 
of FlowFiles. I created a simple test: GenerateFlowFile that generated 100 
lines of text -> MergeRecord (to merge together X number of lines of text) -> 
SplitText splitting to 1 line of text per FlowFile -> UpdateAttribute (just to 
have an output where we can see what is output).

Running this on 'master' branch showed that if MergeRecord merged together 
100,000 lines of text, with 1 GB heap (-Xmx1G) we can split it with SplitText, 
but 250,000 lines failed with OOME errors. I forget the max number of lines 
that I successfully split but I believe it was somewhere around 175,000 - 
200,000.

Running this branch, I was able to successfully merge, then split 750,000 lines 
of text - so more than 3x as many FlowFiles could be produced in a single 
iteration. These changes were mostly isolated to the framework but did touch 
SplitText a bit to avoid holding a List<FlowFileRecord> in the processor where 
the FlowFiles were stale - updated to hold the latest versions of the FlowFiles 
so that the stale ones could be removed by the garbage collector.

> Improve efficiency of FlowFiles' heap usage
> -------------------------------------------
>
>                 Key: NIFI-5533
>                 URL: https://issues.apache.org/jira/browse/NIFI-5533
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Looking at the code, I see several places that we can improve the heap that 
> NiFi uses for FlowFiles:
>  * When StandardPreparedQuery is used (any time Expression Language is 
> evaluated), it creates a StringBuilder and iterates over all Expressions, 
> evaluating them and concatenating the results together. If there is only a 
> single Expression, though, we can avoid this and just return the value 
> obtained from the Expression. While this will improve the amount of garbage 
> collected, it plays a more important role: it avoids creating a new String 
> object for the FlowFile's attribute map. Currently, if 1 million FlowFiles go 
> through UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1 
> million copies of that String on the heap. If we just returned the result of 
> evaluating the Expression, we would instead have 1 copy of that String.
>  * Similar to above, it may make sense in UpdateAttribute to cache N number 
> of entries, so that when an expression like ${filename}.txt is evaluated, 
> even though a new String is generated by StandardPreparedQuery, we can 
> resolve that to the same String object when storing as a FlowFile attribute. 
> This would work similar to {{String.intern()}} but not use 
> {{String.intern()}} because we don't want to store an unbounded number of 
> these values in the {{String.intern()}} cache - we want to cap the number of 
> entries, in case the values aren't always reused.
>  * Every FlowFile that is created by StandardProcessSession has a 'filename' 
> attribute added. The value is obtained by calling 
> {{String.valueOf(System.nanoTime());}} This comes with a few downsides. 
> Firstly, the system call is a bit expensive (though not bad). Secondly, the 
> filename is not very unique - it's common with many dataflows and concurrent 
> tasks running to have several FlowFiles with 'naming collisions'. Most of 
> all, though, it means that we are keeping that String on the heap. A simple 
> test shows that instead using the UUID as the default filename resulted in 
> allowing 20% more FlowFiles to be generated on the same heap before running 
> out of heap.
>  * {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap 
> for every call. If we instead created a copy of it once when the 
> StandardProcessContext was created, we could instead just return that one Map 
> every time, since it can't change over the lifetime of the ProcessContext. 
> This is more about garbage collection and general processor performance than 
> about heap utilization but still in the same realm.
> I am sure that there are far more of these nuances but these are certainly 
> worth tackling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to