Mark Payne created NIFI-10362:
---------------------------------

             Summary: Cluster can disconnect node as soon as it rejoins cluster 
upon restart
                 Key: NIFI-10362
                 URL: https://issues.apache.org/jira/browse/NIFI-10362
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
            Reporter: Mark Payne
            Assignee: Mark Payne


When the Cluster Coordinator disconnects a node due to a user requesting that 
the node get disconnected, the node is immediately marked as DISCONNECTED, and 
then a background thread is responsible for notifying the node that it's been 
disconnected. The background task attempts several times if it cannot 
successfully send the notification.

However, if the node is disconnected and then restarted before it's been 
notified, we have a situation in which the node becomes CONNECTING (and 
possibly then CONNECTED), and then the background task is triggered. This then 
results in the node being told that it's DISCONNECTED. But the Cluster 
Coordinator doesn't think so (because its already changed the state back to 
CONNECTING/CONNECTED).

While the chances that this happens are slim in production and it's easily 
worked around (by simply waiting a few seconds after disconnecting a node 
before restarting it, or just restarting without disconnecting) it causes a lot 
of problems for system tests and potentially other automated activities.

It results in the following log message in the Cluster Coordinator:
{code:java}
2022-08-15 00:47:50,200 ERROR [Disconnect localhost:5672] 
org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator Failed to 
notify localhost:5672 that it has been disconnected from the cluster due to 
User anonymous requested that node be disconnected from cluster {code}
And then we see confusing error messages such as:
{code:java}
2022-08-15 00:48:01,461 INFO [Replicate Request Thread-23] 
org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator
 Received a status of 200 from localhost:5672 for request PUT 
/nifi-api/flow/process-groups/root when performing first stage of two-stage 
commit. The action will not occur. Node explanation: 
{"id":"root","state":"STOPPED"} {code}
This is because when the cluster coordinator replicates the request to all 
nodes, the node that thinks it is disconnected receives the request and 
performs the action. It then responds with a "200 OK" but it should have noted 
that it's the first phase of a 2-phase action and responded with "201 Continue".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to