[ 
https://issues.apache.org/jira/browse/NIFI-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941078#comment-15941078
 ] 

Mark Payne commented on NIFI-3257:
----------------------------------

Using the -XX:+PrintGC JVM option and using YourKit to profile garbage 
collection, along with significant DEBUG logging that I've added, I am seeing 
that the problem largely is due to excessive GC runs. I'm seeing up to 25% of 
my JVM time spent performing Garbage Collection. I've marked this ticket as 
being related to NIFI-3636 and NIFI-3648 because these tickets are intended to 
address the heavy garbage collection.

I've also found that why we use multiple threads to replicate REST API calls, 
we do not read or merge node responses in parallel. This is done serially after 
all "Response" objects have been obtained. This is very inefficient and can 
result in very long request replication times. It could even result in one slow 
node causing other nodes' responses to timeout meaning that if Node 1 is slow 
to respond (due to GC or whatever), then the responses from Nodes 4, 5, and 6, 
for instance, could time out. As a result, nodes 4, 5, and 6 could be kicked 
out of the cluster as a result of Node 1 being slow.

> Cluster stability issues during high throughput
> -----------------------------------------------
>
>                 Key: NIFI-3257
>                 URL: https://issues.apache.org/jira/browse/NIFI-3257
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.0.0, 1.1.0, 1.1.1, 1.0.1
>            Reporter: Jeff Storck
>
> During high throughput of data in a cluster (135MB/s), nodes experience 
> frequent disconnects (every few minutes) and role switching (Primary and 
> Cluster Coordinator).  This makes API requests difficult since the requests 
> can not be replicated to all nodes while reconnecting.  The cluster can 
> recover for a time (as mentioned above, for a few minutes) before going 
> through another round of disconnects and role switching.
> The cluster is able to continue to process data during these connection and 
> role-switching issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to