[ 
https://issues.apache.org/jira/browse/NIFI-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Bende updated NIFI-8204:
------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> When Cluster Coordinator dies suddenly, is possible for Component Revisions 
> to be inconsistent across nodes in cluster
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-8204
>                 URL: https://issues.apache.org/jira/browse/NIFI-8204
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Critical
>             Fix For: 1.13.0
>
>
> I encountered a scenario in a 2-node cluster where Node 0 was the Cluster 
> Coordinator. It suddenly died and was restarted by the RunNiFi process. The 
> restart occurred more quickly than the zookeeper session timeout. Once the 
> node was rejoined to the cluster, I started to see errors when attempting to 
> modify a component that "Node xyz is unable to fulfill this request due to  
> [0, null, <uuid>] is not the most up-to-date revision. This component appears 
> to have been modified."
> Refreshing the browser did not help. This indicates that nodes in the cluster 
> have different component revisions.
> After looking through logs, here is the series of events that led to this 
> situation:
>  
> Node 0 restarts but is still Cluster Coordinator. Has topology showing all 
> nodes disconnected, all revisions empty.
> Node 1 heartbeats to Node 0. Node 0 responds saying: Your cluster topology is 
> wrong. node-1 should be DISCONNECTED due to Has Not Yet Connected.
> Node 1 updates topology as directed
> Node 1 becomes cluster coordinator because Node 0 hasn't yet connected and 
> its ZooKeeper session times out
> Node 1 receives heartbeat from itself
> Node 1 determines that it hasn't yet connected (based on topology received 
> from Node 0) so issues reconnection request.
> Node 1 changes state of Node 1 from DISCONNECTED to CONNECTING. Notifies Node 
> 0 of the topology update.
> Node 1 relinquishes role as cluster coordinator
> Node 1 requests (to itself) to join cluster
> Node 1 receives ConnectionResponse (from itself) that includes a collection 
> of 79 revisions
> Node 0 finishes startup. Has set of empty revisions.
> Node 0 becomes cluster coordinator
> Node 1 sends heartbeat to Node 0
> Node 0 marks Node 1 as Connected to Cluster
>  
> We should address this by keeping track of the number of updates to the 
> Revision Manager and sending this in Heartbeat messages. When the Cluster 
> Coordinator receives a heartbeat, it should compare the update count to its 
> own internal update count. If the heartbeat's update count is higher, it 
> should request that the sending node reconnect to the cluster. This will 
> ensure that if this situation were to arise again, the node would reconnect 
> and get the most up-to-date set of revisions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to