Vladimir Steshin created IGNITE-23118:
-----------------------------------------
Summary: Insufficient backward connection check.
Key: IGNITE-23118
URL: https://issues.apache.org/jira/browse/IGNITE-23118
Project: Ignite
Issue Type: Bug
Reporter: Vladimir Steshin
We do the node status backward check only by opening socket:
{code:java}
ServerImpl#SocketReader#checkConnection(TcpDiscoveryNode node, int timeout):
InetSocketAddress addr = addrs.get(addrIdx.getAndIncrement()):
try (Socket sock = new Socket()) {
if (liveAddrHolder.get() == null) {
sock.connect(addr, perAddrTimeout);
liveAddrHolder.compareAndSet(null, addr);
}
}
{code}
We write no byte and wait for no any trivial response. If JVM stucks GC pause
but accepts socket connection, this check gives a false positive result. This
can issue wrong node leaves the cluster. A node before the hanging one.
Consider:
1) There a cluster with nodes 'A', 'B', 'C'.
2) 'B' delays in GC pause or waits for some threads to stop at safe points.
Its discovery threads are already suspended and do not read or write
messages/responses.
3) 'A' fails to send a message to 'B' and sees the timeout.
4) 'A' connects to 'C', asks to check 'B' and to establish new permanent
cluster connection 'A'->'C' if 'C' cannot check/ping 'B'.
5) 'C' pings 'B', successfully creates connection to it (Socket#connect()).
And closes the socket just after it was opened.
6) 'C' denies establishing a permanent cluster connection with 'A', answers
that 'B' is alive.
7) 'A' tries to connect to 'B' again. Successfully connects to it
(Socket#connect()), but receives no any answer because the JVM of 'B' can only
accept connections, but the reading/writing to socket Ignite's threads are
suspended.
8) 'A' loops in #3 - #7 till reaches `connectionRecoveryTimeout`.
9) 'A' segments, leaves the cluster despite it is alive and is able to
establish a permanent cluster connection to 'C'.
We should either make this check writing something to the socket and waiting
for a response or even remove it at all.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)