[jira] [Comment Edited] (IGNITE-11348) Ping node procedure may fail when another node leaves the cluster

Sergey Chugunov (JIRA) Wed, 20 Feb 2019 00:06:57 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772734#comment-16772734
 ]


Sergey Chugunov edited comment on IGNITE-11348 at 2/20/19 8:05 AM:
-------------------------------------------------------------------

[~dpavlov], the whole sequence of events leading to the issue looks like as 
following:
# _leaving node_ sitting on a *host0:port0* disco address leaves the cluster 
(address becomes free);
# _new node_ binds to the same *host0:port0* address and sends join request;
# _old node_ receives join request and starts pinging _new node_;
# NODE_LEFT event for _leaving node_ arrives to _old node_; as part of handling 
of NODE_LEFT socket for ongoing ping is retrieved from *pingMap* by address and 
closed (incorrectly as this ping has nothing to do with _leaving node_)

To avoid this situation I add nodeID to ping future and check it before closing 
socket on NODE_LEFT. The ID enables to distinguish ping request to _new node_ 
despite of _new node_ and _leaving node_ have the same disco address.


was (Author: sergey-chugunov):
[~dpavlov], the whole sequence of events leading to the issue looks like as 
following:
# _leaving node_ sitting on a *host0:port0* disco address leaves the cluster 
(address becomes free);
# _new node_ binds to the same *host0:port0* address and sends join request;
# _old node_ receives join request and starts pinging _new node_;
# NODE_LEFT event for _leaving node_ arrives to _old node_; as part of handling 
of NODE_LEFT socket for ongoing ping is closed (incorrectly as this ping has 
nothing to do with _leaving node_)

To avoid this situation I add nodeID to ping future and check it before closing 
socket on NODE_LEFT. The ID enables to distinguish ping request to _new node_ 
despite of _new node_ and _leaving node_ have the same disco address.

> Ping node procedure may fail when another node leaves the cluster
> -----------------------------------------------------------------
>
>                 Key: IGNITE-11348
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11348
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Sergey Chugunov
>            Assignee: Sergey Chugunov
>            Priority: Critical
>             Fix For: 2.8
>
>
> Additional pinging of node on join implemented in IGNITE-5569 may incorrectly 
> fail leading to shutting down joining node.
> The reason for this is that if another node from the same host bound to the 
> same discovery port as joining node has left the cluster right before joining 
> node, socket used for pinging gets closed.
> This leads to the situation when pinging node considers joining node as 
> "unreachable" and fails it with JOIN_IMPOSSIBLE error code.
> Workaround: simply start again node failed on join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (IGNITE-11348) Ping node procedure may fail when another node leaves the cluster

Reply via email to