daniverltd opened a new issue, #11271:
URL: https://github.com/apache/ignite/issues/11271
I've documented this issue in JIRA IGNITE-2094.
SSL has to be enabled to trigger this deadlock.
ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately
attempts to read from a socket which has no socket timeout set. If, as can
happen during periods of network instability, one nodes thinks it has
successfully sent a message to another node but the other node hasn't received
the message, then both nodes can become blocked in the same unmarshal() call,
each waiting for the other to send something.
A handshake timeout eventually triggers and attempts to close the socket to
break the stalemate but before closing the socket the GridNioSslFilter ->
onSessionClose() function is invoked and that tries to acquire the sslHandler
lock but the lock is already owned by the socket read or other related thread;
the result is deadlock.
A separate watchdog thread spots that the system timer thread has stopped
updating its heartbeat time value and reports "Blocked system-critical thread
has been detected" and triggers the failure handler.
If the failure handler is set to restart, the node restart process is
triggered which first attempts to cleanly close all existing connections;
eventually it tries to close deadlocked connection but before doing so the
GridNioSslFilter again attempts to acquire the sslHandler lock first,
deadlocking the restart process too,
Suggested fix(es): Add socket timeout before calling unmarshal() and/or add
time limit in GridNioSslFilter when waiting to acquire the sslHandler lock.
Stack traces of relevant threads:
Thread [name="tcp-disco-sock-reader-[3cff52b3 IP:32602 client]#285#531",
id=569, state=RUNNABLE, blockCnt=4,
waitCnt=0](https://issues.apache.org/jira/browse/IGNITE-20940?filter=-2#285-#531%22,%20id=569,%20state=RUNNABLE,%20blockCnt=4,%20waitCnt=0)
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at
sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:475)
at
sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:469)
at
sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:69)
at
sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1266)
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:76)
at
sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:943)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked java.io.BufferedInputStream@20b3a454
at
o.a.i.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53)
at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2837)
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2853)
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3330)
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:939)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:401)
at
o.a.i.marshaller.jdk.JdkMarshallerObjectInputStream.<init>(JdkMarshallerObjectInputStream.java:43)
at
o.a.i.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:122)
at
o.a.i.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
at o.a.i.i.util.IgniteUtils.unmarshal(IgniteUtils.java:10709)
at
o.a.i.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7020)
at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
Thread [name="grid-nio-worker-client-listener-1-#33", id=53, state=RUNNABLE,
blockCnt=383,
waitCnt=1](https://issues.apache.org/jira/browse/IGNITE-20940?filter=-2#33%22,%20id=53,%20state=RUNNABLE,%20blockCnt=383,%20waitCnt=1)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:418)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:397)
- locked sun.security.ssl.SSLEngineImpl@2f9b9b2e
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:626)
at
o.a.i.i.util.nio.ssl.GridNioSslHandler.unwrap0(GridNioSslHandler.java:610)
at
o.a.i.i.util.nio.ssl.GridNioSslHandler.unwrapData(GridNioSslHandler.java:518)
at
o.a.i.i.util.nio.ssl.GridNioSslHandler.messageReceived(GridNioSslHandler.java:336)
at
o.a.i.i.util.nio.ssl.GridNioSslFilter.onMessageReceived(GridNioSslFilter.java:397)
at
o.a.i.i.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
at
o.a.i.i.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:3752)
at
o.a.i.i.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:175)
at
o.a.i.i.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1379)
at
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2526)
at
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2281)
at
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
at java.lang.Thread.run(Thread.java:750)
The blocked system timer thread:
Thread [name="grid-timeout-worker-#22", id=40, state=WAITING, blockCnt=4,
waitCnt=622037](https://issues.apache.org/jira/browse/IGNITE-20940?filter=-2#22%22,%20id=40,%20state=WAITING,%20blockCnt=4,%20waitCnt=622037)
Lock
[object=java.util.concurrent.locks.ReentrantLock$NonfairSync@3ccdf067,
ownerName=grid-nio-worker-client-listener-1-#33, ownerId=53]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at
o.a.i.i.util.nio.ssl.GridNioSslFilter.onSessionClose(GridNioSslFilter.java:431)
at
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
at
o.a.i.i.util.nio.GridNioCodecFilter.onSessionClose(GridNioCodecFilter.java:137)
at
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
at
o.a.i.i.util.nio.GridNioAsyncNotifyFilter.onSessionClose(GridNioAsyncNotifyFilter.java:124)
at
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
at
o.a.i.i.util.nio.GridNioFilterChain$TailFilter.onSessionClose(GridNioFilterChain.java:274)
at
o.a.i.i.util.nio.GridNioFilterChain.onSessionClose(GridNioFilterChain.java:203)
at
o.a.i.i.util.nio.GridNioSessionImpl.close(GridNioSessionImpl.java:169)
at
o.a.i.i.util.nio.GridSelectorNioSessionImpl.close(GridSelectorNioSessionImpl.java:498)
at
o.a.i.i.processors.odbc.ClientListenerNioListener$1.run(ClientListenerNioListener.java:264)
at
o.a.i.i.processors.timeout.GridTimeoutProcessor$CancelableTask.onTimeout(GridTimeoutProcessor.java:365)
- locked
o.a.i.i.processors.timeout.GridTimeoutProcessor$CancelableTask@a2e6d09
at
o.a.i.i.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234)
at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
at java.lang.Thread.run(Thread.java:750)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]