[ 
https://issues.apache.org/jira/browse/NIFI-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088589#comment-18088589
 ] 

ASF subversion and git services commented on NIFI-16006:
--------------------------------------------------------

Commit 42c3006b12304d309ee573c3069c9f1ff4806920 in nifi's branch 
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=42c3006b123 ]

NIFI-16006: Addressed issue where a node can be disconnected, then so… (#11331)

* NIFI-16006: Addressed issue where a node can be disconnected, then soon after 
re-connect but then be quickly told to disconnect due to a queued up 
'Disconnect' message from the original disconnection. Now, we use a 
'generation' flag so we know to ignore the message, and we also cancel the 
background task that is trying to deliver it.

> Cluster coordinator can disconnect a freshly-rejoined node
> ----------------------------------------------------------
>
>                 Key: NIFI-16006
>                 URL: https://issues.apache.org/jira/browse/NIFI-16006
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When a NiFi cluster node is disconnected through the REST API (PUT 
> /controller/cluster/nodes/\{id} with state DISCONNECTING) and a heartbeat 
> from that node arrives at the coordinator immediately afterward — because the 
> heartbeat was created and dispatched on the node before the node received the 
> disconnection notification — AbstractHeartbeatMonitor logs:
> Ignoring received heartbeat from disconnected node <host:port>. Node was 
> disconnected due to [User Disconnected Node]. Issuing disconnection request.
> and enqueues a new DISCONNECTION_REQUEST directed at that node. 
> NodeClusterCoordinator's Disconnect <nodeId> thread retries the delivery 
> indefinitely until the notification is successfully received, logging Failed 
> to notify <host:port> that it has been disconnected on each failed attempt.
> If the node's JVM is stopped before the retry succeeds and then restarted, 
> the new JVM sends a Cluster Connection Request to the coordinator, is 
> accepted as CONNECTING, and within milliseconds the queued 
> DISCONNECTION_REQUEST (originally intended for the old JVM) is finally 
> delivered to it. StandardFlowService on the new JVM processes the 
> disconnection-notification and flips the node to "Not Clustered", but not 
> before the node's first heartbeat reaches the coordinator. The coordinator 
> therefore observes the node as CONNECTED for one heartbeat cycle and only 
> catches up to reality when it times out the missing heartbeats (~17 seconds 
> with the default 2-second heartbeat interval and 8x missing-heartbeat 
> threshold).
> Net effect: any observer that reads cluster state during the brief 
> false-CONNECTED window — including REST clients, the UI, and system-test 
> helpers such as NiFiSystemIT.waitForAllNodesConnected() — sees a healthy 
> cluster and proceeds. The cluster silently degrades several seconds later 
> with the node DISCONNECTED for "Lack of Heartbeat" and no auto-reconnect.
> h3. Root cause:
> Two cooperating issues, either of which alone would prevent the bug:
> NodeClusterCoordinator does not cancel pending DISCONNECTION_REQUEST retry 
> attempts when it receives a fresh Connection Request from the same node. The 
> retry succeeds against the new JVM and is interpreted as a legitimate 
> user-issued disconnect.
> StandardFlowService does not track a connection-generation identifier that 
> would let it discard a DISCONNECTION_REQUEST that is older than its current 
> Connection Request. As a result, the freshly-joined node blindly processes 
> the stale message and disconnects itself.
> h3. Reproduction steps:
> Start a 2-node NiFi cluster.
> Confirm both nodes are CONNECTED.
> PUT /nifi-api/controller/cluster/nodes/\{node2Id}
> with status DISCONNECTING.
> Immediately kill the node 2 JVM (do not wait for the disconnect retries on 
> the coordinator to settle).
> Restart the node 2 JVM.
> Poll GET /nifi-api/controller/cluster/nodes from the coordinator: node 2 
> reports CONNECTED briefly, then transitions to DISCONNECTED with disconnect 
> reason Lack of Heartbeat ~16-17 seconds later. Nothing reconnects it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to