[
https://issues.apache.org/jira/browse/NIFI-10362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580862#comment-17580862
]
ASF subversion and git services commented on NIFI-10362:
--------------------------------------------------------
Commit 21503f6353c33063b7acff5915a94397aad72926 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=21503f6353 ]
NIFI-10362: When asynchronous node disconnect is issued, do not send disconnect
to node if the node becomes reconnected in the interim. Also, addressed the
issue in which a disconnected node acts on a replicated request during the
first phase by detect that it's the first phase if configured for cluster, not
when only when connected to a cluster.
This closes #6308
Signed-off-by: David Handermann <[email protected]>
> Cluster can disconnect node as soon as it rejoins cluster upon restart
> ----------------------------------------------------------------------
>
> Key: NIFI-10362
> URL: https://issues.apache.org/jira/browse/NIFI-10362
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 1.18.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> When the Cluster Coordinator disconnects a node due to a user requesting that
> the node get disconnected, the node is immediately marked as DISCONNECTED,
> and then a background thread is responsible for notifying the node that it's
> been disconnected. The background task attempts several times if it cannot
> successfully send the notification.
> However, if the node is disconnected and then restarted before it's been
> notified, we have a situation in which the node becomes CONNECTING (and
> possibly then CONNECTED), and then the background task is triggered. This
> then results in the node being told that it's DISCONNECTED. But the Cluster
> Coordinator doesn't think so (because its already changed the state back to
> CONNECTING/CONNECTED).
> While the chances that this happens are slim in production and it's easily
> worked around (by simply waiting a few seconds after disconnecting a node
> before restarting it, or just restarting without disconnecting) it causes a
> lot of problems for system tests and potentially other automated activities.
> It results in the following log message in the Cluster Coordinator:
> {code:java}
> 2022-08-15 00:47:50,200 ERROR [Disconnect localhost:5672]
> org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator Failed to
> notify localhost:5672 that it has been disconnected from the cluster due to
> User anonymous requested that node be disconnected from cluster {code}
> And then we see confusing error messages such as:
> {code:java}
> 2022-08-15 00:48:01,461 INFO [Replicate Request Thread-23]
> org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator
> Received a status of 200 from localhost:5672 for request PUT
> /nifi-api/flow/process-groups/root when performing first stage of two-stage
> commit. The action will not occur. Node explanation:
> {"id":"root","state":"STOPPED"} {code}
> This is because when the cluster coordinator replicates the request to all
> nodes, the node that thinks it is disconnected receives the request and
> performs the action. It then responds with a "200 OK" but it should have
> noted that it's the first phase of a 2-phase action and responded with "201
> Continue".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)