[
https://issues.apache.org/jira/browse/NIFI-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898223#comment-16898223
]
Mark Payne commented on NIFI-6517:
----------------------------------
This appears to be easy to replicate. Create a GenerateFlowFile processor.
Connect to UpdateAttribute. Load Balance the connection using Round Robin. For
GenerateFlowFile, set batch size to 100 FlowFiles, and set the content size to
something sizable enough that transferring data from one node to another node
is not nearly instantaneous (for example, 250 KB). Start processors. Wait a few
seconds for data to queue up and start being distributed between the nodes. Use
`kill -9` to kill one of the NiFi processes. After 40-ish seconds the cluster
coordinator will determine that the node is not sending heartbeats and
disconnect the node. At this point, we will see the above stack trace.
Note that this does not always happen. I tried doing this several times and it
happened about 1 out of 3 times.
> Load Balanced Connections can show counts that are inaccurate, resulting in
> data not moving through connection
> --------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-6517
> URL: https://issues.apache.org/jira/browse/NIFI-6517
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Critical
> Fix For: 1.10.0
>
>
> I've encountered an issue where data that is load balanced using the Round
> Robin strategy will show data in the queue but the data cannot be processed
> by the follow-on processor. List Queue indicates no FlowFiles, and Empty
> Queue indicates no Flow Files.
> Error in the logs indicates that there is a bug in maintaining the proper
> size of the FlowFile Queue:
> {code:java}
> 2019-08-01 11:39:08,422 INFO [Heartbeat Monitor Thread-1]
> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 2 heartbeats in
> 32480 nanos
> 2019-08-01 11:39:08,422 INFO [Heartbeat Monitor Thread-1]
> o.a.n.c.c.node.NodeClusterCoordinator localhost:8482 requested disconnection
> from cluster due to Have not received a heartbeat from node in 40 seconds
> 2019-08-01 11:39:08,422 INFO [Heartbeat Monitor Thread-1]
> o.a.n.c.c.node.NodeClusterCoordinator Status of localhost:8482 changed from
> NodeConnectionStatus[nodeId=localhost:8482, state=CONNECTED, updateId=30] to
> NodeConnectionStatus[nodeId=localhost:8482, state=DISCONNECTED, Disconnect
> Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from
> node in 40 seconds, updateId=31]
> 2019-08-01 11:39:08,441 ERROR [Load-Balanced Client Thread-2]
> o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged
> from FlowFile Queue Size[ ActiveQueue=[500, 2560000 Bytes], Swap Queue=[4845,
> 24806400 Bytes], Swap Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile
> Queue Size[ ActiveQueue=[500, 2560000 Bytes], Swap Queue=[4845, 24806400
> Bytes], Swap Files=[0], Unacknowledged=[-945, -4838400 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> at
> org.apache.nifi.controller.queue.SwappablePriorityQueue.logIfNegative(SwappablePriorityQueue.java:945)
> at
> org.apache.nifi.controller.queue.SwappablePriorityQueue.incrementUnacknowledgedQueueSize(SwappablePriorityQueue.java:935)
> at
> org.apache.nifi.controller.queue.SwappablePriorityQueue.acknowledge(SwappablePriorityQueue.java:426)
> at
> org.apache.nifi.controller.queue.clustered.partition.RemoteQueuePartition$1.onTransactionFailed(RemoteQueuePartition.java:160)
> at
> org.apache.nifi.controller.queue.clustered.client.async.TransactionFailureCallback.onTransactionFailed(TransactionFailureCallback.java:26)
> at
> org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClient.nodeDisconnected(NioAsyncLoadBalanceClient.java:295)
> at
> org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask.run(NioAsyncLoadBalanceClientTask.java:71)
> at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
> Note that this occurs immediately after the status of one of the other nodes
> in the cluster changes to DISCONNECTED.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)