[
https://issues.apache.org/jira/browse/IGNITE-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772734#comment-16772734
]
Sergey Chugunov edited comment on IGNITE-11348 at 2/20/19 8:05 AM:
-------------------------------------------------------------------
[~dpavlov], the whole sequence of events leading to the issue looks like as
following:
# _leaving node_ sitting on a *host0:port0* disco address leaves the cluster
(address becomes free);
# _new node_ binds to the same *host0:port0* address and sends join request;
# _old node_ receives join request and starts pinging _new node_;
# NODE_LEFT event for _leaving node_ arrives to _old node_; as part of handling
of NODE_LEFT socket for ongoing ping is retrieved from *pingMap* by address and
closed (incorrectly as this ping has nothing to do with _leaving node_)
To avoid this situation I add nodeID to ping future and check it before closing
socket on NODE_LEFT. The ID enables to distinguish ping request to _new node_
despite of _new node_ and _leaving node_ have the same disco address.
was (Author: sergey-chugunov):
[~dpavlov], the whole sequence of events leading to the issue looks like as
following:
# _leaving node_ sitting on a *host0:port0* disco address leaves the cluster
(address becomes free);
# _new node_ binds to the same *host0:port0* address and sends join request;
# _old node_ receives join request and starts pinging _new node_;
# NODE_LEFT event for _leaving node_ arrives to _old node_; as part of handling
of NODE_LEFT socket for ongoing ping is closed (incorrectly as this ping has
nothing to do with _leaving node_)
To avoid this situation I add nodeID to ping future and check it before closing
socket on NODE_LEFT. The ID enables to distinguish ping request to _new node_
despite of _new node_ and _leaving node_ have the same disco address.
> Ping node procedure may fail when another node leaves the cluster
> -----------------------------------------------------------------
>
> Key: IGNITE-11348
> URL: https://issues.apache.org/jira/browse/IGNITE-11348
> Project: Ignite
> Issue Type: Bug
> Reporter: Sergey Chugunov
> Assignee: Sergey Chugunov
> Priority: Critical
> Fix For: 2.8
>
>
> Additional pinging of node on join implemented in IGNITE-5569 may incorrectly
> fail leading to shutting down joining node.
> The reason for this is that if another node from the same host bound to the
> same discovery port as joining node has left the cluster right before joining
> node, socket used for pinging gets closed.
> This leads to the situation when pinging node considers joining node as
> "unreachable" and fails it with JOIN_IMPOSSIBLE error code.
> Workaround: simply start again node failed on join.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)