Mark Payne created NIFI-8196:
--------------------------------

             Summary: When a node is disconnected due to failing to service a 
request, upon cluster reconnection it may not participate in leader election
                 Key: NIFI-8196
                 URL: https://issues.apache.org/jira/browse/NIFI-8196
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
            Reporter: Mark Payne
            Assignee: Mark Payne


NIFI-7920 fixed a bug that can result in nodes getting the wrong Revision for 
some components. The fix for that, however, appears to have caused a 
regression. When a Node is disconnected due to failing to service a replicated 
API request, such as a component being stopped/started/moved, it will now 
unregister from leader election for Primary Node / Cluster Coordinator. 
However, if it then reconnects, it does not re-register for the roles. As a 
result, we can have a situation where a node disconnects and reconnects and 
never is able to become Cluster Coordinator. If this happens to all nodes in a 
cluster, we can end up where no nodes are eligible to become Cluster 
Coordinator. This results in logs such as:
{code:java}
2021-02-03 20:14:55,167 WARN [Clustering Tasks Thread-3] 
o.apache.nifi.controller.FlowController Failed to send heartbeat due to: 
java.lang.IllegalArgumentException: Cannot send heartbeat to address []. 
Address must be in <hostname>:<port> format {code}
And errors in the UI stating:
{code:java}
Action cannot be performed because there is currently no Cluster Coordinator 
elected. The request should be tried again after a moment, after a Cluster 
Coordinator has been automatically elected.. Returning Service Unavailable 
response. {code}
At this point, there will never be a cluster coordinator until nodes are 
restarted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to