Mark Payne created NIFI-8196:
--------------------------------
Summary: When a node is disconnected due to failing to service a
request, upon cluster reconnection it may not participate in leader election
Key: NIFI-8196
URL: https://issues.apache.org/jira/browse/NIFI-8196
Project: Apache NiFi
Issue Type: Bug
Components: Core Framework
Reporter: Mark Payne
Assignee: Mark Payne
NIFI-7920 fixed a bug that can result in nodes getting the wrong Revision for
some components. The fix for that, however, appears to have caused a
regression. When a Node is disconnected due to failing to service a replicated
API request, such as a component being stopped/started/moved, it will now
unregister from leader election for Primary Node / Cluster Coordinator.
However, if it then reconnects, it does not re-register for the roles. As a
result, we can have a situation where a node disconnects and reconnects and
never is able to become Cluster Coordinator. If this happens to all nodes in a
cluster, we can end up where no nodes are eligible to become Cluster
Coordinator. This results in logs such as:
{code:java}
2021-02-03 20:14:55,167 WARN [Clustering Tasks Thread-3]
o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
java.lang.IllegalArgumentException: Cannot send heartbeat to address [].
Address must be in <hostname>:<port> format {code}
And errors in the UI stating:
{code:java}
Action cannot be performed because there is currently no Cluster Coordinator
elected. The request should be tried again after a moment, after a Cluster
Coordinator has been automatically elected.. Returning Service Unavailable
response. {code}
At this point, there will never be a cluster coordinator until nodes are
restarted.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)