Mark Bean created NIFI-9433:
-------------------------------
Summary: Load Balancer hangs
Key: NIFI-9433
URL: https://issues.apache.org/jira/browse/NIFI-9433
Project: Apache NiFi
Issue Type: Bug
Components: Core Framework
Affects Versions: 1.15.0
Reporter: Mark Bean
Simplified scenario to demonstrate problem:
A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced
connection -> UpdateAttribute. And, unconnected to the first two processors,
Funnel #1 -> non-load-balanced Connection -> Funnel #2.
GenerateFlowFile is scheduled to run on Primary Node only. It is started. This
causes the connection to be very busy load balancing (round robin). Then, the
connection between the two funnels is removed.
Immediately, an error is thrown, and the flow gets stuck in a state of
constantly throwing errors indicating that a connection (the one just deleted)
does not exist and cannot be balanced.
It is unclear why this connection is being considered by the load balancer at
all.
The sequence of errors include the following:
Primary Node reports
2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811]
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged from
FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap
Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[
ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0],
Unacknowledged=[-206, -20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size
2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811]
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from FlowFile
Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0],
Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[206,
20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206,
-20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size
The above may be a symptom of subsequent errors in the log:
Primary Node reports:
2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6]
o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer
{host:port}
java.io.IOException: Failed to negotiate Protocol Version with Peer
{host:port}. Recommended version 1 but instead of an ACCEPT or REJECT response
got back a response of 33.
Non-Primary Node reports:
2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4]
o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer
{fqdn/IP:port}
java.io.IOException: Expected to receive Transaction Completion Indicator from
Peer {fqdn} but instead received a value of 1
The highly concerning part is this error which indicates a Connection which was
not scheduled to load balance was attempting to receive a FlowFile.
Non-Primary Node reports:
2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808]
o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from
Peer {fqdn} for Connection with ID {uuid} but no connection exists with that ID.
Note the that {uuid} value in this message corresponds to the Connection that
was removed causing the errors to begin. Should the above message ever occur?
Does the load balancer ever consider Connections which are configured as "Do
not load balance"
Users have also reported that FlowFiles have been load balanced from one
Connection to another, unrelated Connection on the other Node. (This is still
being verified.)
Finally, on the UI the load-balanced connection indicates it is actively load
balancing some number (206 in this case) of FlowFiles currently in the
connection. And, attempts to "list queue" on this connection show no FlowFiles.
Presumably they are being held by the load balancer and are inaccessible in the
queue.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)