Mark Payne created NIFI-10362:
---------------------------------
Summary: Cluster can disconnect node as soon as it rejoins cluster
upon restart
Key: NIFI-10362
URL: https://issues.apache.org/jira/browse/NIFI-10362
Project: Apache NiFi
Issue Type: Bug
Components: Core Framework
Reporter: Mark Payne
Assignee: Mark Payne
When the Cluster Coordinator disconnects a node due to a user requesting that
the node get disconnected, the node is immediately marked as DISCONNECTED, and
then a background thread is responsible for notifying the node that it's been
disconnected. The background task attempts several times if it cannot
successfully send the notification.
However, if the node is disconnected and then restarted before it's been
notified, we have a situation in which the node becomes CONNECTING (and
possibly then CONNECTED), and then the background task is triggered. This then
results in the node being told that it's DISCONNECTED. But the Cluster
Coordinator doesn't think so (because its already changed the state back to
CONNECTING/CONNECTED).
While the chances that this happens are slim in production and it's easily
worked around (by simply waiting a few seconds after disconnecting a node
before restarting it, or just restarting without disconnecting) it causes a lot
of problems for system tests and potentially other automated activities.
It results in the following log message in the Cluster Coordinator:
{code:java}
2022-08-15 00:47:50,200 ERROR [Disconnect localhost:5672]
org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator Failed to
notify localhost:5672 that it has been disconnected from the cluster due to
User anonymous requested that node be disconnected from cluster {code}
And then we see confusing error messages such as:
{code:java}
2022-08-15 00:48:01,461 INFO [Replicate Request Thread-23]
org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator
Received a status of 200 from localhost:5672 for request PUT
/nifi-api/flow/process-groups/root when performing first stage of two-stage
commit. The action will not occur. Node explanation:
{"id":"root","state":"STOPPED"} {code}
This is because when the cluster coordinator replicates the request to all
nodes, the node that thinks it is disconnected receives the request and
performs the action. It then responds with a "200 OK" but it should have noted
that it's the first phase of a 2-phase action and responded with "201 Continue".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)