[jira] [Updated] (IGNITE-23118) Insufficient backward connection check.

Vladimir Steshin (Jira) Fri, 30 Aug 2024 03:17:07 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vladimir Steshin updated IGNITE-23118:
--------------------------------------
    Description: 
We do the node status backward check only by socket opening:
{code:java}
ServerImpl#SocketReader#checkConnection(TcpDiscoveryNode node, int timeout):

InetSocketAddress addr = addrs.get(addrIdx.getAndIncrement()):

try (Socket sock = new Socket()) {
    if (liveAddrHolder.get() == null) {
        sock.connect(addr, perAddrTimeout);

        liveAddrHolder.compareAndSet(null, addr);
    }
}
{code}

We write no byte and wait for no any trivial response. If JVM stucks GC pause 
but accepts socket connection, this check gives a false positive result. This 
can issue wrong node leaves the cluster. A node before the hanging one.

Consider:
    1) There a cluster with nodes 'A', 'B', 'C'.
    2) 'B' delays in GC pause or waits for some threads to stop at safe points. 
Its discovery threads are already suspended and do not read or write 
messages/responses.
    3) 'A' fails to send a message to 'B' and sees the timeout.
    4) 'A' connects to 'C', asks to check 'B' and to establish new permanent 
cluster connection 'A'->'C' if 'C' cannot check/ping 'B'. 
    5) 'C' pings 'B', successfully creates connection to it (Socket#connect()). 
And closes the socket just after it was opened.
    6) 'C' denies establishing a permanent cluster connection with 'A', answers 
that 'B' is alive.
    7) 'A' tries to connect to 'B' again. Successfully connects to it 
(Socket#connect()), but receives no any answer because  the JVM of 'B' can only 
accept connections, but the reading/writing to socket Ignite's threads are 
suspended.
    8) 'A' loops in #3 - #7 till reaches `connectionRecoveryTimeout`.
    9) 'A' segments, leaves the cluster despite it is alive and is able to 
establish a permanent cluster connection to 'C'.

We should either make this check writing something to the socket and waiting 
for a response or even remove it at all.

  was:
We do the node status backward check only by opening socket:
{code:java}
ServerImpl#SocketReader#checkConnection(TcpDiscoveryNode node, int timeout):

InetSocketAddress addr = addrs.get(addrIdx.getAndIncrement()):

try (Socket sock = new Socket()) {
    if (liveAddrHolder.get() == null) {
        sock.connect(addr, perAddrTimeout);

        liveAddrHolder.compareAndSet(null, addr);
    }
}
{code}

We write no byte and wait for no any trivial response. If JVM stucks GC pause 
but accepts socket connection, this check gives a false positive result. This 
can issue wrong node leaves the cluster. A node before the hanging one.

Consider:
    1) There a cluster with nodes 'A', 'B', 'C'.
    2) 'B' delays in GC pause or waits for some threads to stop at safe points. 
Its discovery threads are already suspended and do not read or write 
messages/responses.
    3) 'A' fails to send a message to 'B' and sees the timeout.
    4) 'A' connects to 'C', asks to check 'B' and to establish new permanent 
cluster connection 'A'->'C' if 'C' cannot check/ping 'B'. 
    5) 'C' pings 'B', successfully creates connection to it (Socket#connect()). 
And closes the socket just after it was opened.
    6) 'C' denies establishing a permanent cluster connection with 'A', answers 
that 'B' is alive.
    7) 'A' tries to connect to 'B' again. Successfully connects to it 
(Socket#connect()), but receives no any answer because  the JVM of 'B' can only 
accept connections, but the reading/writing to socket Ignite's threads are 
suspended.
    8) 'A' loops in #3 - #7 till reaches `connectionRecoveryTimeout`.
    9) 'A' segments, leaves the cluster despite it is alive and is able to 
establish a permanent cluster connection to 'C'.

We should either make this check writing something to the socket and waiting 
for a response or even remove it at all.


> Insufficient backward connection check.
> ---------------------------------------
>
>                 Key: IGNITE-23118
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23118
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladimir Steshin
>            Priority: Major
>              Labels: ise
>
> We do the node status backward check only by socket opening:
> {code:java}
> ServerImpl#SocketReader#checkConnection(TcpDiscoveryNode node, int timeout):
> InetSocketAddress addr = addrs.get(addrIdx.getAndIncrement()):
> try (Socket sock = new Socket()) {
>     if (liveAddrHolder.get() == null) {
>         sock.connect(addr, perAddrTimeout);
>         liveAddrHolder.compareAndSet(null, addr);
>     }
> }
> {code}
> We write no byte and wait for no any trivial response. If JVM stucks GC pause 
> but accepts socket connection, this check gives a false positive result. This 
> can issue wrong node leaves the cluster. A node before the hanging one.
> Consider:
>     1) There a cluster with nodes 'A', 'B', 'C'.
>     2) 'B' delays in GC pause or waits for some threads to stop at safe 
> points. Its discovery threads are already suspended and do not read or write 
> messages/responses.
>     3) 'A' fails to send a message to 'B' and sees the timeout.
>     4) 'A' connects to 'C', asks to check 'B' and to establish new permanent 
> cluster connection 'A'->'C' if 'C' cannot check/ping 'B'. 
>     5) 'C' pings 'B', successfully creates connection to it 
> (Socket#connect()). And closes the socket just after it was opened.
>     6) 'C' denies establishing a permanent cluster connection with 'A', 
> answers that 'B' is alive.
>     7) 'A' tries to connect to 'B' again. Successfully connects to it 
> (Socket#connect()), but receives no any answer because  the JVM of 'B' can 
> only accept connections, but the reading/writing to socket Ignite's threads 
> are suspended.
>     8) 'A' loops in #3 - #7 till reaches `connectionRecoveryTimeout`.
>     9) 'A' segments, leaves the cluster despite it is alive and is able to 
> establish a permanent cluster connection to 'C'.
> We should either make this check writing something to the socket and waiting 
> for a response or even remove it at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-23118) Insufficient backward connection check.

Reply via email to