[jira] [Updated] (NIFI-9433) Load Balancer hangs

Mark Bean (Jira) Thu, 02 Dec 2021 06:12:33 -0800


     [ 
https://issues.apache.org/jira/browse/NIFI-9433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mark Bean updated NIFI-9433:
----------------------------
    Description: 
Simplified scenario to demonstrate problem:
A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced 
connection -> UpdateAttribute. And, unconnected to the first two processors, 
Funnel #1 -> non-load-balanced Connection -> Funnel #2.
GenerateFlowFile is scheduled to run on Primary Node only. It is started. This 
causes the connection to be very busy load balancing (round robin). Then, the 
connection between the two funnels is removed.
Immediately, an error is thrown, and the flow gets stuck in a state of 
constantly throwing errors indicating that a connection (the one just deleted) 
does not exist and cannot be balanced.
It is unclear why this connection is being considered by the load balancer at 
all.

The sequence of errors include the following:
Primary Node reports 
2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] 
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged from 
FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap 
Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ 
ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
Unacknowledged=[-206, -20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size
2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] 
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from FlowFile 
Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[206, 
20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, 
-20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size

The above may be a symptom of subsequent errors in the log:
Primary Node reports:
2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] 
o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer 
<host:port>
java.io.IOException: Failed to negotiate Protocol Version with Peer 
<host:port>. Recommended version 1 but instead of an ACCEPT or REJECT response 
got back a response of 33.

Non-Primary Node reports:
2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] 
o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with 
Peer<fqdn/IP:port>
java.io.IOException: Expected to receive Transaction Completion Indicator from 
Peer <fqdn> but instead received a value of 1

The highly concerning part is this error which indicates a Connection which was 
not scheduled to load balance was attempting to receive a FlowFile.
Non-Primary Node reports:
2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] 
o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from 
Peer <fqdn> for Connection with ID <uuid> but no connection exists with that ID.

Note the that <uuid> value in this message corresponds to the Connection that 
was removed causing the errors to begin. Should the above message ever occur? 
Does the load balancer ever consider Connections which are configured as "Do 
not load balance"

Users have also reported that FlowFiles have been load balanced from one 
Connection to another, unrelated Connection on the other Node. (This is still 
being verified.)

Finally, on the UI the load-balanced connection indicates it is actively load 
balancing some number (206 in this case) of FlowFiles currently in the 
connection. And, attempts to "list queue" on this connection show no FlowFiles. 
Presumably they are being held by the load balancer and are inaccessible in the 
queue.

  was:
Simplified scenario to demonstrate problem:
A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced 
connection -> UpdateAttribute. And, unconnected to the first two processors, 
Funnel #1 -> non-load-balanced Connection -> Funnel #2.
GenerateFlowFile is scheduled to run on Primary Node only. It is started. This 
causes the connection to be very busy load balancing (round robin). Then, the 
connection between the two funnels is removed.
Immediately, an error is thrown, and the flow gets stuck in a state of 
constantly throwing errors indicating that a connection (the one just deleted) 
does not exist and cannot be balanced.
It is unclear why this connection is being considered by the load balancer at 
all.

The sequence of errors include the following:
Primary Node reports 
2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] 
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged from 
FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap 
Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ 
ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
Unacknowledged=[-206, -20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size
2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] 
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from FlowFile 
Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[206, 
20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, 
-20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size

The above may be a symptom of subsequent errors in the log:
Primary Node reports:
2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] 
o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer 
{host:port}
java.io.IOException: Failed to negotiate Protocol Version with Peer 
{host:port}. Recommended version 1 but instead of an ACCEPT or REJECT response 
got back a response of 33.

Non-Primary Node reports:
2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] 
o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer 
{fqdn/IP:port}
java.io.IOException: Expected to receive Transaction Completion Indicator from 
Peer {fqdn} but instead received a value of 1

The highly concerning part is this error which indicates a Connection which was 
not scheduled to load balance was attempting to receive a FlowFile.
Non-Primary Node reports:
2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] 
o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from 
Peer {fqdn} for Connection with ID {uuid} but no connection exists with that ID.

Note the that {uuid} value in this message corresponds to the Connection that 
was removed causing the errors to begin. Should the above message ever occur? 
Does the load balancer ever consider Connections which are configured as "Do 
not load balance"

Users have also reported that FlowFiles have been load balanced from one 
Connection to another, unrelated Connection on the other Node. (This is still 
being verified.)

Finally, on the UI the load-balanced connection indicates it is actively load 
balancing some number (206 in this case) of FlowFiles currently in the 
connection. And, attempts to "list queue" on this connection show no FlowFiles. 
Presumably they are being held by the load balancer and are inaccessible in the 
queue.




> Load Balancer hangs
> -------------------
>
>                 Key: NIFI-9433
>                 URL: https://issues.apache.org/jira/browse/NIFI-9433
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.15.0
>            Reporter: Mark Bean
>            Priority: Critical
>
> Simplified scenario to demonstrate problem:
> A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced 
> connection -> UpdateAttribute. And, unconnected to the first two processors, 
> Funnel #1 -> non-load-balanced Connection -> Funnel #2.
> GenerateFlowFile is scheduled to run on Primary Node only. It is started. 
> This causes the connection to be very busy load balancing (round robin). 
> Then, the connection between the two funnels is removed.
> Immediately, an error is thrown, and the flow gets stuck in a state of 
> constantly throwing errors indicating that a connection (the one just 
> deleted) does not exist and cannot be balanced.
> It is unclear why this connection is being considered by the load balancer at 
> all.
> The sequence of errors include the following:
> Primary Node reports 
> 2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] 
> o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged 
> from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], 
> Swap Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ 
> ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
> Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> 2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] 
> o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from 
> FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap 
> Files=[0], Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ 
> ActiveQueue=[206, 20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
> Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> The above may be a symptom of subsequent errors in the log:
> Primary Node reports:
> 2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] 
> o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer 
> <host:port>
> java.io.IOException: Failed to negotiate Protocol Version with Peer 
> <host:port>. Recommended version 1 but instead of an ACCEPT or REJECT 
> response got back a response of 33.
> Non-Primary Node reports:
> 2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] 
> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with 
> Peer<fqdn/IP:port>
> java.io.IOException: Expected to receive Transaction Completion Indicator 
> from Peer <fqdn> but instead received a value of 1
> The highly concerning part is this error which indicates a Connection which 
> was not scheduled to load balance was attempting to receive a FlowFile.
> Non-Primary Node reports:
> 2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] 
> o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from 
> Peer <fqdn> for Connection with ID <uuid> but no connection exists with that 
> ID.
> Note the that <uuid> value in this message corresponds to the Connection that 
> was removed causing the errors to begin. Should the above message ever occur? 
> Does the load balancer ever consider Connections which are configured as "Do 
> not load balance"
> Users have also reported that FlowFiles have been load balanced from one 
> Connection to another, unrelated Connection on the other Node. (This is still 
> being verified.)
> Finally, on the UI the load-balanced connection indicates it is actively load 
> balancing some number (206 in this case) of FlowFiles currently in the 
> connection. And, attempts to "list queue" on this connection show no 
> FlowFiles. Presumably they are being held by the load balancer and are 
> inaccessible in the queue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (NIFI-9433) Load Balancer hangs

Reply via email to