Hi,

              We’re leveraging netty (v4.1.92) via Vertx (v4.3.8) 
configured to use OpenJdkSSLEngine with a deletegatedTaskExecutor() when 
creating the ssl_handler for our webserver.  In general, we see SSL 
operations (e.g., certificate validation) occurring on a worker thread, but 
we have run into a deadlock due to our current implementation and wanted 
some feedback on options.

 

The crux of the issue is that during our certificate validation via our 
custom TrustManager, we need to communicate with another internal component 
(which is running on a typical event loop thread) to complete the 
validation processing, but other SSL operations block the event loop on 
which that internal service is running by grabbing a lock.  Its my 
understanding that not all blocking SSL operations are offloaded, so there 
are likely more cases like these floating around.

 

What we are specifically seeing is that while we’re waiting on a response 
from the internal component to validate the certificate, a simultaneous 
request to close a TLS connection comes in.  The closing of the session is 
running on the event loop and blocks by waiting on a lock, but 
unfortunately, that lock is held by the worker thread waiting on a response 
from the internal component.  IOW, the internal service cannot respond to 
the worker thread because it its event loop is blocked by the close 
connection processing.  At this point, our entire webserver is dead in the 
water until we restart the JVM.

 

Sample traces are listed below from this example:

   - Close connection processing is running on event loop thread 0 and is 
   blocked trying to grab a lock that is held by a worker thread 
   - Cert validation in the TrustManager is running on a worker thread and 
   is awaiting a response from the blocked service running on event loop 
   thread 0 
   - Internal service to complete the cert validation is affined to event 
   loop thread 0, but its blocked because of the close connection processing 

 

We’ve looked at several options:

   - Detect the deadlock and fail the client authentication.  The issue 
   here is bad user experience so we’d like to avoid it. 
   - Segregate the our TLS handling in the webserver and the blocking SSL 
   handling into two DIFFERENT worker threads so that that we never block an 
   event loop.  This is a risker change for us, though. 
   - Remove the need to communicate with the internal service running on an 
   event loop thread. 

 

Any thoughts on other/better ways to handle this?

 

Thanks in advance,

Chris

 

# close connection handler running on event loop thread 0 is blocked

Sep 15 10:13:43.470794 APM00203802612-A control-path[37578]: 
"vert.x-eventloop-thread-0" #25 prio=5 os_prio=0 cpu=595313.19ms 
elapsed=28293.78s tid=0x00007f5bb8021800 nid=0x22c waiting for monitor 
entry  [0x00007f5c0034d000]

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:   
 java.lang.Thread.State: BLOCKED (on object monitor)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
sun.security.ssl.SSLEngineImpl.closeOutbound(
[email protected]/SSLEngineImpl.java:749)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         - 
waiting to lock <0x00000000dca1d0f8> (a sun.security.ssl.SSLEngineImpl)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1874)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1078)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

Sep 15 10:13:43.471494 APM00203802612-A control-path[37578]:         at 
java.lang.Thread.run([email protected]/Thread.java:829)

 

 

# Trust manager waiting on a response from our internal component which is 
running on event loop thread 0.  Note that the lock that the close 
connection handler is waiting on is held by this thread.
Sep 15 10:13:43.479257 APM00203802612-A control-path[37578]: 
"vert.x-internal-blocking-1" #42 prio=5 os_prio=0 cpu=23498.06ms 
elapsed=28290.74s tid=0x00007f5bb02b1000 nid=0x23b waiting on condition 
 [0x00007f5bc4e59000]Sep 15 10:13:43.479257 APM00203802612-A 
control-path[37578]:    java.lang.Thread.State: WAITING (parking)Sep 15 
10:13:43.479257 APM00203802612-A control-path[37578]:         at 
jdk.internal.misc.Unsafe.park([email protected]/Native Method)Sep 15 
10:13:43.479257 APM00203802612-A control-path[37578]:         - parking to 
wait for  <0x00000000dca24ec0> (a java.util.concurrent.CountDownLatch$Sync)Sep 
15 10:13:43.479257 APM00203802612-A control-path[37578]:         at 
java.util.concurrent.locks.LockSupport.park(
[email protected]/LockSupport.java:194)Sep 15 10:13:43.479257 
APM00203802612-A control-path[37578]:         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(
[email protected]/AbstractQueuedSynchronizer.java:885)Sep 15 
10:13:43.479257 APM00203802612-A control-path[37578]:         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(
[email protected]/AbstractQueuedSynchronizer.java:1039)Sep 15 
10:13:43.479257 APM00203802612-A control-path[37578]:         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(
[email protected]/AbstractQueuedSynchronizer.java:1345)Sep 15 
10:13:43.479257 APM00203802612-A control-path[37578]:         at 
java.util.concurrent.CountDownLatch.await(
[email protected]/CountDownLatch.java:232)Sep 15 10:13:43.479257 
APM00203802612-A control-path[37578]:         at 
io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:86)Sep
 
15 10:13:43.479257 APM00203802612-A control-path[37578]:         at 
io.reactivex.rxjava3.core.Single.blockingGet(Single.java:3645)Sep 15 
10:13:43.479257 APM00203802612-A control-path[37578]:         at 
com.emc.cyclone.contexts.security.credstore.CycTrustOptions$CycTrustManager.validateCertificateWithCmfa(CycTrustOptions.java:442)
 

 

-- 
You received this message because you are subscribed to the Google Groups 
"Netty discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/netty/8f31febf-5aaa-47b5-8a37-b71d63edf6acn%40googlegroups.com.

Reply via email to