[
https://issues.apache.org/jira/browse/NIFI-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pierre Villard resolved NIFI-16006.
-----------------------------------
Fix Version/s: 2.10.0
Resolution: Fixed
> Cluster coordinator can disconnect a freshly-rejoined node
> ----------------------------------------------------------
>
> Key: NIFI-16006
> URL: https://issues.apache.org/jira/browse/NIFI-16006
> Project: Apache NiFi
> Issue Type: Bug
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 2.10.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> When a NiFi cluster node is disconnected through the REST API (PUT
> /controller/cluster/nodes/\{id} with state DISCONNECTING) and a heartbeat
> from that node arrives at the coordinator immediately afterward — because the
> heartbeat was created and dispatched on the node before the node received the
> disconnection notification — AbstractHeartbeatMonitor logs:
> Ignoring received heartbeat from disconnected node <host:port>. Node was
> disconnected due to [User Disconnected Node]. Issuing disconnection request.
> and enqueues a new DISCONNECTION_REQUEST directed at that node.
> NodeClusterCoordinator's Disconnect <nodeId> thread retries the delivery
> indefinitely until the notification is successfully received, logging Failed
> to notify <host:port> that it has been disconnected on each failed attempt.
> If the node's JVM is stopped before the retry succeeds and then restarted,
> the new JVM sends a Cluster Connection Request to the coordinator, is
> accepted as CONNECTING, and within milliseconds the queued
> DISCONNECTION_REQUEST (originally intended for the old JVM) is finally
> delivered to it. StandardFlowService on the new JVM processes the
> disconnection-notification and flips the node to "Not Clustered", but not
> before the node's first heartbeat reaches the coordinator. The coordinator
> therefore observes the node as CONNECTED for one heartbeat cycle and only
> catches up to reality when it times out the missing heartbeats (~17 seconds
> with the default 2-second heartbeat interval and 8x missing-heartbeat
> threshold).
> Net effect: any observer that reads cluster state during the brief
> false-CONNECTED window — including REST clients, the UI, and system-test
> helpers such as NiFiSystemIT.waitForAllNodesConnected() — sees a healthy
> cluster and proceeds. The cluster silently degrades several seconds later
> with the node DISCONNECTED for "Lack of Heartbeat" and no auto-reconnect.
> h3. Root cause:
> Two cooperating issues, either of which alone would prevent the bug:
> NodeClusterCoordinator does not cancel pending DISCONNECTION_REQUEST retry
> attempts when it receives a fresh Connection Request from the same node. The
> retry succeeds against the new JVM and is interpreted as a legitimate
> user-issued disconnect.
> StandardFlowService does not track a connection-generation identifier that
> would let it discard a DISCONNECTION_REQUEST that is older than its current
> Connection Request. As a result, the freshly-joined node blindly processes
> the stale message and disconnects itself.
> h3. Reproduction steps:
> Start a 2-node NiFi cluster.
> Confirm both nodes are CONNECTED.
> PUT /nifi-api/controller/cluster/nodes/\{node2Id}
> with status DISCONNECTING.
> Immediately kill the node 2 JVM (do not wait for the disconnect retries on
> the coordinator to settle).
> Restart the node 2 JVM.
> Poll GET /nifi-api/controller/cluster/nodes from the coordinator: node 2
> reports CONNECTED briefly, then transitions to DISCONNECTED with disconnect
> reason Lack of Heartbeat ~16-17 seconds later. Nothing reconnects it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)