Vladimir Steshin created IGNITE-23118:
-----------------------------------------

             Summary: Insufficient backward connection check.
                 Key: IGNITE-23118
                 URL: https://issues.apache.org/jira/browse/IGNITE-23118
             Project: Ignite
          Issue Type: Bug
            Reporter: Vladimir Steshin


We do the node status backward check only by opening socket:
{code:java}
ServerImpl#SocketReader#checkConnection(TcpDiscoveryNode node, int timeout):

InetSocketAddress addr = addrs.get(addrIdx.getAndIncrement()):

try (Socket sock = new Socket()) {
    if (liveAddrHolder.get() == null) {
        sock.connect(addr, perAddrTimeout);

        liveAddrHolder.compareAndSet(null, addr);
    }
}
{code}

We write no byte and wait for no any trivial response. If JVM stucks GC pause 
but accepts socket connection, this check gives a false positive result. This 
can issue wrong node leaves the cluster. A node before the hanging one.

Consider:
    1) There a cluster with nodes 'A', 'B', 'C'.
    2) 'B' delays in GC pause or waits for some threads to stop at safe points. 
Its discovery threads are already suspended and do not read or write 
messages/responses.
    3) 'A' fails to send a message to 'B' and sees the timeout.
    4) 'A' connects to 'C', asks to check 'B' and to establish new permanent 
cluster connection 'A'->'C' if 'C' cannot check/ping 'B'. 
    5) 'C' pings 'B', successfully creates connection to it (Socket#connect()). 
And closes the socket just after it was opened.
    6) 'C' denies establishing a permanent cluster connection with 'A', answers 
that 'B' is alive.
    7) 'A' tries to connect to 'B' again. Successfully connects to it 
(Socket#connect()), but receives no any answer because  the JVM of 'B' can only 
accept connections, but the reading/writing to socket Ignite's threads are 
suspended.
    8) 'A' loops in #3 - #7 till reaches `connectionRecoveryTimeout`.
    9) 'A' segments, leaves the cluster despite it is alive and is able to 
establish a permanent cluster connection to 'C'.

We should either make this check writing something to the socket and waiting 
for a response or even remove it at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to