[ 
https://issues.apache.org/jira/browse/NIFI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596487#comment-16596487
 ] 

ASF GitHub Bot commented on NIFI-5533:
--------------------------------------

GitHub user markap14 opened a pull request:

    https://github.com/apache/nifi/pull/2974

    NIFI-5533: Be more efficient with heap utilization

    Thank you for submitting a contribution to Apache NiFi.
    
    In order to streamline the review of the contribution we ask you
    to ensure the following steps have been taken:
    
    ### For all changes:
    - [ ] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [ ] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
    
    - [ ] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
    - [ ] Have you written or updated unit tests to verify your changes?
    - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
    - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
    - [ ] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?
    
    ### For documentation related changes:
    - [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered?
    
    ### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/markap14/nifi NIFI-5533

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/2974.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2974
    
----
commit 35297268ef90e83385c7487b17100e41ccc22340
Author: Mark Payne <markap14@...>
Date:   2018-08-17T18:08:14Z

    NIFI-5533: Be more efficient with heap utilization
     - Updated FlowFile Repo / Write Ahead Log so that any update that writes 
more than 1 MB of data is written to a file inside the FlowFile Repo rather 
than being buffered in memory
     - Update SplitText so that it does not hold FlowFiles that are not the 
latest version in heap. Doing them from being garbage collected, so while the 
Process Session is holding the latest version of the FlowFile, SplitText is 
holding an older version, and this results in two copies of the same FlowFile 
object

commit 45d3ab2ad166b5df573adbfab2f5e8be9efc8705
Author: Mark Payne <markap14@...>
Date:   2018-08-23T16:58:59Z

    NIFI-5533: Checkpoint

commit bd866a1652e671a62f71233ced160f0a215f7a5f
Author: Mark Payne <markap14@...>
Date:   2018-08-23T17:55:25Z

    NIFI-5533: Bug Fixes

----


> Improve efficiency of FlowFiles' heap usage
> -------------------------------------------
>
>                 Key: NIFI-5533
>                 URL: https://issues.apache.org/jira/browse/NIFI-5533
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Looking at the code, I see several places that we can improve the heap that 
> NiFi uses for FlowFiles:
>  * When StandardPreparedQuery is used (any time Expression Language is 
> evaluated), it creates a StringBuilder and iterates over all Expressions, 
> evaluating them and concatenating the results together. If there is only a 
> single Expression, though, we can avoid this and just return the value 
> obtained from the Expression. While this will improve the amount of garbage 
> collected, it plays a more important role: it avoids creating a new String 
> object for the FlowFile's attribute map. Currently, if 1 million FlowFiles go 
> through UpdateAttribute to copy the 'abc' attribute to 'xyz', we have 1 
> million copies of that String on the heap. If we just returned the result of 
> evaluating the Expression, we would instead have 1 copy of that String.
>  * Similar to above, it may make sense in UpdateAttribute to cache N number 
> of entries, so that when an expression like ${filename}.txt is evaluated, 
> even though a new String is generated by StandardPreparedQuery, we can 
> resolve that to the same String object when storing as a FlowFile attribute. 
> This would work similar to {{String.intern()}} but not use 
> {{String.intern()}} because we don't want to store an unbounded number of 
> these values in the {{String.intern()}} cache - we want to cap the number of 
> entries, in case the values aren't always reused.
>  * Every FlowFile that is created by StandardProcessSession has a 'filename' 
> attribute added. The value is obtained by calling 
> {{String.valueOf(System.nanoTime());}} This comes with a few downsides. 
> Firstly, the system call is a bit expensive (though not bad). Secondly, the 
> filename is not very unique - it's common with many dataflows and concurrent 
> tasks running to have several FlowFiles with 'naming collisions'. Most of 
> all, though, it means that we are keeping that String on the heap. A simple 
> test shows that instead using the UUID as the default filename resulted in 
> allowing 20% more FlowFiles to be generated on the same heap before running 
> out of heap.
>  * {{AbstractComponentNode.getProperties()}} creates a copy of its HashMap 
> for every call. If we instead created a copy of it once when the 
> StandardProcessContext was created, we could instead just return that one Map 
> every time, since it can't change over the lifetime of the ProcessContext. 
> This is more about garbage collection and general processor performance than 
> about heap utilization but still in the same realm.
> I am sure that there are far more of these nuances but these are certainly 
> worth tackling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to