daniverltd opened a new issue, #11271:
URL: https://github.com/apache/ignite/issues/11271

   I've documented this issue in JIRA IGNITE-2094.
   
   SSL has to be enabled to trigger this deadlock.
   
   ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately 
attempts to read from a socket which has no socket timeout set. If, as can 
happen during periods of network instability, one nodes thinks it has 
successfully sent a message to another node but the other node hasn't received 
the message, then both nodes can become blocked in the same unmarshal() call, 
each waiting for the other to send something.
   
   A handshake timeout eventually triggers and attempts to close the socket to 
break the stalemate but before closing the socket the GridNioSslFilter -> 
onSessionClose() function is invoked and that tries to acquire the sslHandler 
lock but the lock is already owned by the socket read or other related thread; 
the result is deadlock.
   
   A separate watchdog thread spots that the system timer thread has stopped 
updating its heartbeat time value and reports "Blocked system-critical thread 
has been detected" and triggers the failure handler.
   
   If the failure handler is set to restart, the node restart process is 
triggered which first attempts to cleanly close all existing connections; 
eventually it tries to close deadlocked connection but before doing so the 
GridNioSslFilter again attempts to acquire the sslHandler lock first, 
deadlocking the restart process too,
   
   Suggested fix(es): Add socket timeout before calling unmarshal() and/or add 
time limit in GridNioSslFilter when waiting to acquire the sslHandler lock.
   
    
   
   Stack traces of relevant threads:
   
   Thread [name="tcp-disco-sock-reader-[3cff52b3 IP:32602 client]#285#531", 
id=569, state=RUNNABLE, blockCnt=4, 
waitCnt=0](https://issues.apache.org/jira/browse/IGNITE-20940?filter=-2#285-#531%22,%20id=569,%20state=RUNNABLE,%20blockCnt=4,%20waitCnt=0)
           at java.net.SocketInputStream.socketRead0(Native Method)
           at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
           at java.net.SocketInputStream.read(SocketInputStream.java:171)
           at java.net.SocketInputStream.read(SocketInputStream.java:141)
           at 
sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:475)
           at 
sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:469)
           at 
sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:69)
           at 
sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1266)
           at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:76)
           at 
sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:943)
           at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
           at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
           at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
           - locked java.io.BufferedInputStream@20b3a454
           at 
o.a.i.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53)
           at 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2837)
           at 
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2853)
           at 
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3330)
           at 
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:939)
           at java.io.ObjectInputStream.<init>(ObjectInputStream.java:401)
           at 
o.a.i.marshaller.jdk.JdkMarshallerObjectInputStream.<init>(JdkMarshallerObjectInputStream.java:43)
           at 
o.a.i.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:122)
           at 
o.a.i.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
           at o.a.i.i.util.IgniteUtils.unmarshal(IgniteUtils.java:10709)
           at 
o.a.i.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7020)
           at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
   
    
   
   Thread [name="grid-nio-worker-client-listener-1-#33", id=53, state=RUNNABLE, 
blockCnt=383, 
waitCnt=1](https://issues.apache.org/jira/browse/IGNITE-20940?filter=-2#33%22,%20id=53,%20state=RUNNABLE,%20blockCnt=383,%20waitCnt=1)
           at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:418)
           at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:397)
           - locked sun.security.ssl.SSLEngineImpl@2f9b9b2e
           at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:626)
           at 
o.a.i.i.util.nio.ssl.GridNioSslHandler.unwrap0(GridNioSslHandler.java:610)
           at 
o.a.i.i.util.nio.ssl.GridNioSslHandler.unwrapData(GridNioSslHandler.java:518)
           at 
o.a.i.i.util.nio.ssl.GridNioSslHandler.messageReceived(GridNioSslHandler.java:336)
           at 
o.a.i.i.util.nio.ssl.GridNioSslFilter.onMessageReceived(GridNioSslFilter.java:397)
           at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
           at 
o.a.i.i.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:3752)
           at 
o.a.i.i.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:175)
           at 
o.a.i.i.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1379)
           at 
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2526)
           at 
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2281)
           at 
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
           at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
           at java.lang.Thread.run(Thread.java:750)
   
   The blocked system timer thread:
   
   Thread [name="grid-timeout-worker-#22", id=40, state=WAITING, blockCnt=4, 
waitCnt=622037](https://issues.apache.org/jira/browse/IGNITE-20940?filter=-2#22%22,%20id=40,%20state=WAITING,%20blockCnt=4,%20waitCnt=622037)
       Lock 
[object=java.util.concurrent.locks.ReentrantLock$NonfairSync@3ccdf067, 
ownerName=grid-nio-worker-client-listener-1-#33, ownerId=53]
           at sun.misc.Unsafe.park(Native Method)
           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
           at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
           at 
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
           at 
o.a.i.i.util.nio.ssl.GridNioSslFilter.onSessionClose(GridNioSslFilter.java:431)
           at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
           at 
o.a.i.i.util.nio.GridNioCodecFilter.onSessionClose(GridNioCodecFilter.java:137)
           at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
           at 
o.a.i.i.util.nio.GridNioAsyncNotifyFilter.onSessionClose(GridNioAsyncNotifyFilter.java:124)
           at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
           at 
o.a.i.i.util.nio.GridNioFilterChain$TailFilter.onSessionClose(GridNioFilterChain.java:274)
           at 
o.a.i.i.util.nio.GridNioFilterChain.onSessionClose(GridNioFilterChain.java:203)
           at 
o.a.i.i.util.nio.GridNioSessionImpl.close(GridNioSessionImpl.java:169)
           at 
o.a.i.i.util.nio.GridSelectorNioSessionImpl.close(GridSelectorNioSessionImpl.java:498)
           at 
o.a.i.i.processors.odbc.ClientListenerNioListener$1.run(ClientListenerNioListener.java:264)
           at 
o.a.i.i.processors.timeout.GridTimeoutProcessor$CancelableTask.onTimeout(GridTimeoutProcessor.java:365)
           - locked 
o.a.i.i.processors.timeout.GridTimeoutProcessor$CancelableTask@a2e6d09
           at 
o.a.i.i.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234)
           at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
           at java.lang.Thread.run(Thread.java:750)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to