[ https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483946#comment-16483946 ]
ASF GitHub Bot commented on NIFI-5225: -------------------------------------- GitHub user FrederikP opened a pull request: https://github.com/apache/nifi/pull/2732 NIFI-5225: Purge event data from event repository when Connectable is removed ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [ ] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder? _Clean install ran through just fine, but contrib-check complained about an unrelated package_ - [x] Have you written or updated unit tests to verify your changes? - [ ] ~~If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?~~ - [ ] ~~If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?~~ - [ ] ~~If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?~~ - [ ] ~~If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?~~ ### For documentation related changes: - ~~[ ] Have you ensured that format looks appropriate for the output in which it is rendered?~~ I introduced the option to purge data from the FlowFileEventRepository (the 5 min ring buffer) to fix this: https://issues.apache.org/jira/browse/NIFI-5225 And it works for our setup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/FrederikP/nifi master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/2732.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2732 ---- commit 4e5a118305c9513cca239c136c48239c501e9907 Author: Frederik Petersen <fp@...> Date: 2018-05-22T10:55:59Z NIFI-5225: Purge event data from event repository when Connectable is removed ---- > Leak in RingBufferEventRepository for frequently updated flows > -------------------------------------------------------------- > > Key: NIFI-5225 > URL: https://issues.apache.org/jira/browse/NIFI-5225 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework > Affects Versions: 1.5.0, 1.6.0 > Environment: HDF-3.1.0.0 > Reporter: Frederik Petersen > Priority: Major > Labels: performance > > We use NiFi's API to change a part of our flow quite frequently. Over the > past weeks we have noticed that the performance of web requests degrades over > time and had a very hard time to find out why. > Today I took a closer look. When using visualvm to sample cpu it already > stood out that the longer the cluster was running, the more time was spent in > 'SecondPrecisionEventContainer.generateReport()' during web requests. This > method is already relied on a lot right after starting the cluster (for big > flows and process groups). But the time spent in it increases (in our setup) > the longer the cluster runs. This increases latency of almost every web > request. Our flow reconfiguration script (calling many NiFi API endpoints) > went from 2 minutes to 20 minutes run time in a few days. > Looking at the source code I couldn't quite figure out why the run time > should increase over time, because the ring buffers always stay the same size > (301 entries|5 minutes). > When sampling memory I noticed quite a lot of EventSum instances, more than > there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. > The "Leak Suspects" overview gave me the final hint to what was wrong. > It reported: > One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by > "<system class loader>" occupies 5,649,926,328 (55.74%) bytes. The instance > is referenced by > org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ > 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ > 0x7f86a0000000". The memory is accumulated in one instance of > "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class > loader>". > The issue is: > When we remove processors, connections, process groups from the flow, their > data is not removed from the ConcurrentHashMap in RingBufferEventRepository. > There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' > method on all 'SecondPrecisionEventContainer's in the map. > This means that the map grows without bounds and every time > 'reportTransferEvents' is called it iterates over all (meaning more and more > over time) entries of the map. This increases latency of every web request > and also a huge amount of memory occupied. > A rough idea to fix this: > Remove the entry for each removed component (processor, process group, > connection, ?...) using their onRemoved Methods in the FlowController > This should stop the map from growing infinitely for any flow where removals > of any components happens frequently. Especially when automated. > Since this is quite urgent for us, I'll try to work on a fix for this and > provide a pull request if successful. > Since no-one noticed this before, I guess we are not the typical user of > NiFi, as we thought it was possible to heavily reconfigure flows using the > API, but with this performance issue, it's not. > Please let me know if I can provide any more helpful detail for this problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)