[
https://issues.apache.org/jira/browse/NIFI-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Payne updated NIFI-8204:
-----------------------------
Status: Patch Available (was: Open)
> When Cluster Coordinator dies suddenly, is possible for Component Revisions
> to be inconsistent across nodes in cluster
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-8204
> URL: https://issues.apache.org/jira/browse/NIFI-8204
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Critical
> Fix For: 1.13.0
>
>
> I encountered a scenario in a 2-node cluster where Node 0 was the Cluster
> Coordinator. It suddenly died and was restarted by the RunNiFi process. The
> restart occurred more quickly than the zookeeper session timeout. Once the
> node was rejoined to the cluster, I started to see errors when attempting to
> modify a component that "Node xyz is unable to fulfill this request due to
> [0, null, <uuid>] is not the most up-to-date revision. This component appears
> to have been modified."
> Refreshing the browser did not help. This indicates that nodes in the cluster
> have different component revisions.
> After looking through logs, here is the series of events that led to this
> situation:
>
> Node 0 restarts but is still Cluster Coordinator. Has topology showing all
> nodes disconnected, all revisions empty.
> Node 1 heartbeats to Node 0. Node 0 responds saying: Your cluster topology is
> wrong. node-1 should be DISCONNECTED due to Has Not Yet Connected.
> Node 1 updates topology as directed
> Node 1 becomes cluster coordinator because Node 0 hasn't yet connected and
> its ZooKeeper session times out
> Node 1 receives heartbeat from itself
> Node 1 determines that it hasn't yet connected (based on topology received
> from Node 0) so issues reconnection request.
> Node 1 changes state of Node 1 from DISCONNECTED to CONNECTING. Notifies Node
> 0 of the topology update.
> Node 1 relinquishes role as cluster coordinator
> Node 1 requests (to itself) to join cluster
> Node 1 receives ConnectionResponse (from itself) that includes a collection
> of 79 revisions
> Node 0 finishes startup. Has set of empty revisions.
> Node 0 becomes cluster coordinator
> Node 1 sends heartbeat to Node 0
> Node 0 marks Node 1 as Connected to Cluster
>
> We should address this by keeping track of the number of updates to the
> Revision Manager and sending this in Heartbeat messages. When the Cluster
> Coordinator receives a heartbeat, it should compare the update count to its
> own internal update count. If the heartbeat's update count is higher, it
> should request that the sending node reconnect to the cluster. This will
> ensure that if this situation were to arise again, the node would reconnect
> and get the most up-to-date set of revisions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)