[
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483964#comment-16483964
]
Joseph Witt commented on NIFI-5225:
-----------------------------------
[~FrederikP] this is extremely impressive! Did you verify this addressed your
case successfully? Talking with mark payne offline he agreed there was a
problem here and your update makes a ton of sense!
> Leak in RingBufferEventRepository for frequently updated flows
> --------------------------------------------------------------
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.5.0, 1.6.0
> Environment: HDF-3.1.0.0
> Reporter: Frederik Petersen
> Priority: Major
> Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the
> past weeks we have noticed that the performance of web requests degrades over
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already
> stood out that the longer the cluster was running, the more time was spent in
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This
> method is already relied on a lot right after starting the cluster (for big
> flows and process groups). But the time spent in it increases (in our setup)
> the longer the cluster runs. This increases latency of almost every web
> request. Our flow reconfiguration script (calling many NiFi API endpoints)
> went from 2 minutes to 20 minutes run time in a few days.
> Looking at the source code I couldn't quite figure out why the run time
> should increase over time, because the ring buffers always stay the same size
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool.
> The "Leak Suspects" overview gave me the final hint to what was wrong.
> It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by
> "<system class loader>" occupies 5,649,926,328 (55.74%) bytes. The instance
> is referenced by
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @
> 0x7f86a0000000". The memory is accumulated in one instance of
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "<system class
> loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository.
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents'
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time
> 'reportTransferEvents' is called it iterates over all (meaning more and more
> over time) entries of the map. This increases latency of every web request
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group,
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of
> NiFi, as we thought it was possible to heavily reconfigure flows using the
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)