[
https://issues.apache.org/jira/browse/NIFI-9433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joe Witt updated NIFI-9433:
---------------------------
Fix Version/s: 1.15.1
> Load Balancer hangs
> -------------------
>
> Key: NIFI-9433
> URL: https://issues.apache.org/jira/browse/NIFI-9433
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.15.0
> Reporter: Mark Bean
> Assignee: Mark Payne
> Priority: Critical
> Labels: connections, load-balanced-connections, load-balancing
> Fix For: 1.16.0, 1.15.1
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Simplified scenario to demonstrate problem:
> A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced
> connection -> UpdateAttribute. And, unconnected to the first two processors,
> Funnel #1 -> non-load-balanced Connection -> Funnel #2.
> GenerateFlowFile is scheduled to run on Primary Node only. It is started.
> This causes the connection to be very busy load balancing (round robin).
> Then, the connection between the two funnels is removed.
> Immediately, an error is thrown, and the flow gets stuck in a state of
> constantly throwing errors indicating that a connection (the one just
> deleted) does not exist and cannot be balanced.
> It is unclear why this connection is being considered by the load balancer at
> all.
> The sequence of errors include the following:
> Primary Node reports
> 2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811]
> o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged
> from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes],
> Swap Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[
> ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0],
> Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> 2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811]
> o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from
> FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap
> Files=[0], Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[
> ActiveQueue=[206, 20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0],
> Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> The above may be a symptom of subsequent errors in the log:
> Primary Node reports:
> 2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6]
> o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer
> <host:port>
> java.io.IOException: Failed to negotiate Protocol Version with Peer
> <host:port>. Recommended version 1 but instead of an ACCEPT or REJECT
> response got back a response of 33.
> Non-Primary Node reports:
> 2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4]
> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with
> Peer<fqdn/IP:port>
> java.io.IOException: Expected to receive Transaction Completion Indicator
> from Peer <fqdn> but instead received a value of 1
> The highly concerning part is this error which indicates a Connection which
> was not scheduled to load balance was attempting to receive a FlowFile.
> Non-Primary Node reports:
> 2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808]
> o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from
> Peer <fqdn> for Connection with ID <uuid> but no connection exists with that
> ID.
> Note the that <uuid> value in this message corresponds to the Connection that
> was removed causing the errors to begin. Should the above message ever occur?
> Does the load balancer ever consider Connections which are configured as "Do
> not load balance"
> Users have also reported that FlowFiles have been load balanced from one
> Connection to another, unrelated Connection on the other Node. (This is still
> being verified.)
> Finally, on the UI the load-balanced connection indicates it is actively load
> balancing some number (206 in this case) of FlowFiles currently in the
> connection. And, attempts to "list queue" on this connection show no
> FlowFiles. Presumably they are being held by the load balancer and are
> inaccessible in the queue.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)