[
https://issues.apache.org/jira/browse/NIFI-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279973#comment-17279973
]
ASF subversion and git services commented on NIFI-8204:
-------------------------------------------------------
Commit 749d05840ba88efc8b42f5434d9223104edfab68 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=749d058 ]
NIFI-8204, NIFI-7866: Send revision update count in heartbeats. If update count
in heartbeat is greater than that of cluster coordinator, request that node
reconnect to get most up-to-date revisions. Cannot check exact equality, as the
values may change between the time a heartbeat is created and the time the
cluster coordinator receives it. However, it should be safe to assume that the
revision won't be greater than that of the cluster coordinator. There is a tiny
window in which it could be, as the sending node may update its revision,
create the heartbeat, send it, and cluster coordinator process it before
updating its own revision. However, this window is incredibly small and would
only result in the sending node reconnecting, which will resolve itself. Also,
when testing this fix, encountered NIFI-7866 and addressed that
NullPointerException.
This closes #4806.
Signed-off-by: Bryan Bende <[email protected]>
> When Cluster Coordinator dies suddenly, is possible for Component Revisions
> to be inconsistent across nodes in cluster
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-8204
> URL: https://issues.apache.org/jira/browse/NIFI-8204
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Critical
> Fix For: 1.13.0
>
>
> I encountered a scenario in a 2-node cluster where Node 0 was the Cluster
> Coordinator. It suddenly died and was restarted by the RunNiFi process. The
> restart occurred more quickly than the zookeeper session timeout. Once the
> node was rejoined to the cluster, I started to see errors when attempting to
> modify a component that "Node xyz is unable to fulfill this request due to
> [0, null, <uuid>] is not the most up-to-date revision. This component appears
> to have been modified."
> Refreshing the browser did not help. This indicates that nodes in the cluster
> have different component revisions.
> After looking through logs, here is the series of events that led to this
> situation:
>
> Node 0 restarts but is still Cluster Coordinator. Has topology showing all
> nodes disconnected, all revisions empty.
> Node 1 heartbeats to Node 0. Node 0 responds saying: Your cluster topology is
> wrong. node-1 should be DISCONNECTED due to Has Not Yet Connected.
> Node 1 updates topology as directed
> Node 1 becomes cluster coordinator because Node 0 hasn't yet connected and
> its ZooKeeper session times out
> Node 1 receives heartbeat from itself
> Node 1 determines that it hasn't yet connected (based on topology received
> from Node 0) so issues reconnection request.
> Node 1 changes state of Node 1 from DISCONNECTED to CONNECTING. Notifies Node
> 0 of the topology update.
> Node 1 relinquishes role as cluster coordinator
> Node 1 requests (to itself) to join cluster
> Node 1 receives ConnectionResponse (from itself) that includes a collection
> of 79 revisions
> Node 0 finishes startup. Has set of empty revisions.
> Node 0 becomes cluster coordinator
> Node 1 sends heartbeat to Node 0
> Node 0 marks Node 1 as Connected to Cluster
>
> We should address this by keeping track of the number of updates to the
> Revision Manager and sending this in Heartbeat messages. When the Cluster
> Coordinator receives a heartbeat, it should compare the update count to its
> own internal update count. If the heartbeat's update count is higher, it
> should request that the sending node reconnect to the cluster. This will
> ensure that if this situation were to arise again, the node would reconnect
> and get the most up-to-date set of revisions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)