[ 
https://issues.apache.org/jira/browse/NIFI-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015927#comment-16015927
 ] 

Mark Payne commented on NIFI-3933:
----------------------------------

A quick glance at the code looks to me like the flaw is as follows:

Assume Node A is the coordinator and suddenly dies.
Assume Node B is then elected the new coordinator.
Node B will now clear any heartbeats that it has received (it may have received 
some really old heartbeats if it has been coordinator before).
Every few seconds, Node B checks for any nodes that haven't heartbeated in "a 
long time." It does this by checking its internal Map<NodeIdentifier, 
NodeHeartbeat>.
Since Node A died, it will never send a heartbeat. As a result, it will never 
show up int that Map.
This means that when Node B looks for old heartbeats it won't find Node A 
because Node A has never sent a heartbeat.
As a result, it never kicks Node A out of the cluster.

This explains why it only happens when the coordinator dies. If Node B had died 
instead, Node A would have known the latest heartbeat timestamp for Node B and 
would properly kick it out of the cluster.

I think we can remedy this by changing our logic of clearing the heartbeats. 
Instead of clearing that map, we should create an entry for each node, with a 
timestamp of 'now'. This way, if the node never heartbeats, there will still be 
an entry in the map.

> Cluster - Cluster Coordinator removal due to lack of heartbeats
> ---------------------------------------------------------------
>
>                 Key: NIFI-3933
>                 URL: https://issues.apache.org/jira/browse/NIFI-3933
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Matt Gilman
>
> If the process for the acting Cluster Coordinator is killed, the node will 
> remain in the cluster despite not receiving heartbeats past the configured 
> threshold. If this same action is taken against a node that is not the 
> Cluster Coordinator, it will be removed once the configured threshold has 
> elapsed.
> From the mailing list [1]
> [1] 
> https://mail-archives.apache.org/mod_mbox/nifi-users/201705.mbox/%3C1495107006491-1942.post%40n4.nabble.com%3E
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to