Hi Ethan,

That sounds like a reasonable hypothesis (although I should caveat this
with I'm far from an expert on gRPC Netty server best practices). Did you
observe any errors or warnings in the server-side logs when this was
occurring? By default, a gRPC Netty server will use a cached thread pool -
the general recommendation (e.g., here
<https://stackoverflow.com/questions/42408634/what-is-the-exact-use-of-the-executor-in-grpc-java-s-serverbuilder-does-it-just>
and here
<https://stackoverflow.com/questions/47505022/does-grpc-server-spin-up-a-new-thread-for-each-request>)
is to specify a fixed thread pool instead via `ServerBuilder.executor`. I
think it may be possible that, if your cached thread pool running your
server application code has spun up lots of threads that are long-running,
it may starve other threads, resulting in the Netty event loops (see
https://groups.google.com/g/grpc-io/c/LrnAbWFozb0) handling incoming
connections being unable to run. Based on the recommendations for using a
fixed thread pool executor instead, I would recommend trying that, making
sure the size of the fixed thread pool is capped at a # of threads that
won't prevent the CPU from running other threads as well.

Thanks,

Eric

On Thu, Jul 30, 2020 at 1:34 PM Ethan Cheng <[email protected]>
wrote:

> Hi Eric, thanks for looking into this.
> I agree that this has nothing to do with `MAX_CONCURRENT_STREAM` and I can
> see why you are thinking about the blockings in the Netty's event loop.
> Since a lot of our requests are long running streaming request which
> transmit a lot data, is it possible that they are using most of the threads
> in the event loop which lead to kind of `thread starvation` for the
> connections coming from the troublesome clients? Then these connections
> stay idled and passed the `DEFAULT_SERVER_KEEPALIVE_TIMEOUT_NANOS`  and
> so these connections became inactive?
> If this is the situation, should we customize the threadpool in Netty?
>
> On Wednesday, July 29, 2020 at 10:25:28 PM UTC-7 [email protected]
> wrote:
>
>> The first stack trace's cause is truncated so I'm not sure what happened
>> there, but the second includes "at
>> io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1792)"
>> after
>> " 
>> io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1073)
>> ", which I believe indicates that the channel - for an unknown (possibly
>> truncated in the stack trace?) reason - moved to idle during the TLS
>> handshake. As for why this would happen under load: just taking a guess,
>> but is it possible your server is performing a blocking operation in the
>> Netty event loop? I don't think MAX_CONCURRENT_STREAMS-related capacity
>> issue would result in this stack trace, as the connection should be more or
>> less cleanly rejected rather than moving a connected channel to inactive
>> mid-handshake as suggested by the provided stacktrace.
>>
>> Thanks,
>>
>> Eric
>>
>> On Wed, Jul 29, 2020 at 4:57 PM Ethan Cheng <[email protected]>
>> wrote:
>>
>>> We had an incident earlier that when one of our client had improper
>>> retry policy that leads to huge wave of requests to the server. Then all
>>> clients started to see `this StatusRuntimeException: UNKNOWN: channel
>>> closed` at client side.
>>>
>>> for one client, it was:
>>> ```
>>> @400000005f0744ef2096858c java.util.concurrent.CompletionException:
>>> io.grpc.StatusRuntimeException: UNKNOWN: channel closed
>>> @400000005f0744ef2096858c Channel Pipeline: [SslHandler#0,
>>> ProtocolNegotiators$ClientTlsHandler#0,
>>> WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
>>> @400000005f0744ef20968974 at
>>> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
>>> ~[?:?]
>>> @400000005f0744ef2096df64 at
>>> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
>>> ~[?:?]
>>> @400000005f0744ef2096df64 at
>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
>>> ~[?:?]
>>> @400000005f0744ef2096e34c at
>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>>> [?:?]
>>> @400000005f0744ef2096e34c at
>>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>>> [?:?]
>>> @400000005f0744ef2096ef04 at
>>> io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
>>> [grpc-stub-1.27.1.jar:1.27.1]
>>> @400000005f0744ef2096fabc at
>>> io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef2096fea4 at
>>> io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef2096fea4 at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef20970674 at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef20970a5c at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef20970a5c at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef2097604c at
>>> io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef20976434 at
>>> io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
>>> [grpc-core-1.29.0.jar:1.29.0]
>>> @400000005f0744ef20976434 at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> [?:?]
>>> @400000005f0744ef20976fec at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> [?:?]
>>> @400000005f0744ef20976fec at java.lang.Thread.run(Thread.java:834) [?:?]
>>> @400000005f0744ef20976fec Caused by: io.grpc.StatusRuntimeException:
>>> UNKNOWN: channel closed
>>> @400000005f0744ef209773d4 Channel Pipeline: [SslHandler#0,
>>> ProtocolNegotiators$ClientTlsHandler#0,
>>> WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
>>> @400000005f0744ef209777bc at
>>> io.grpc.Status.asRuntimeException(Status.java:533)
>>> ~[grpc-api-1.29.0.jar:1.29.0]
>>> @400000005f0744ef209777bc ... 12 more
>>> ```
>>>
>>> for another clients it was:
>>> ```
>>> Caused by: java.nio.channels.ClosedChannelException
>>> at
>>> io.grpc.netty.shaded.io.grpc.netty.Utils.statusFromThrowable(Utils.java:168)
>>> at
>>> io.grpc.netty.shaded.io.grpc.netty.NettyClientTransport$5.operationComplete(NettyClientTransport.java:267)
>>> at
>>> io.grpc.netty.shaded.io.grpc.netty.NettyClientTransport$5.operationComplete(NettyClientTransport.java:261)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:485)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:112)
>>>
>>> java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException:
>>> UNKNOWN: channel closed
>>> at
>>> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>>> ~[?:1.8.0_92]
>>> at
>>> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~
>>> Caused by: io.grpc.StatusRuntimeException: UNKNOWN: channel closed
>>> at io.grpc.Status.asRuntimeException(Status.java:533)
>>> at
>>> io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:442)
>>> io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
>>> io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
>>> ~[sync-bid-strategy.jar:?]
>>> at
>>> io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
>>> ~[sync-bid-strategy.jar:?]
>>> at
>>> io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
>>> ~[sync-bid-strategy.jar:?]
>>> at
>>> io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
>>> at
>>> io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
>>> at
>>> io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
>>> at
>>> io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
>>> at
>>> io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:507)
>>> at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
>>> at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:627)
>>> at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:515)
>>> at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:686)
>>> at
>>> io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:675)
>>> at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>>> at
>>> io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> ~[?:1.8.0_92]
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> ~[?:1.8.0_92]
>>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_92]
>>> Caused by: java.nio.channels.ClosedChannelException
>>> at
>>> io.grpc.netty.shaded.io.grpc.netty.Utils.statusFromThrowable(Utils.java:202)
>>> ~
>>> at
>>> io.grpc.netty.shaded.io.grpc.netty.WriteBufferingAndExceptionHandler.exceptionCaught(WriteBufferingAndExceptionHandler.java:86)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:276)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:268)
>>> at
>>> io.grpc.netty.shaded.io.grpc.netty.ProtocolNegotiators$ClientTlsHandler.userEventTriggered(ProtocolNegotiators.java:373)
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:341)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:327)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:319)
>>> at
>>> io.grpc.netty.shaded.io.netty.handler.ssl.SslUtils.handleHandshakeFailure(SslUtils.java:348)
>>> at
>>> io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1792)
>>> at
>>> io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1073)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1403)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:827)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
>>> at
>>> io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:338)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>> at
>>> io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_92]
>>> Caused by: java.nio.channels.ClosedChannelException
>>> at
>>> io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.channelInactive(...)(Unknown
>>> Source)
>>> ```
>>>
>>> The load is probably not very very high ~40RPS, but most of these
>>> requests are server streaming and will transmit fairly large amount of
>>> data, meaning that it will take considerable amount of time for each
>>> request to complete.
>>> Are we hitting some kind of limitation of our server? I thought it could
>>> be `MAX_CONCURRENT_STREAM`, but not sure the `MAX_CONCURRENT_STREAM`
>>> applies to a client's connection to server, or the server's connections
>>> overall.
>>>
>>> We are using grpc-netty-shaded(1.27.1).
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "grpc.io" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/grpc-io/de583aee-62f7-4fba-9150-277e135e7658n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/grpc-io/de583aee-62f7-4fba-9150-277e135e7658n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups "
> grpc.io" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/grpc-io/aeb390d1-a26c-402e-8a9a-2790cfbc9c17n%40googlegroups.com
> <https://groups.google.com/d/msgid/grpc-io/aeb390d1-a26c-402e-8a9a-2790cfbc9c17n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/CALUXJ7ianbq0BAe9XF3a3S9AsYEzD28dx%3DvxGz_mqsrRsFRB%3DQ%40mail.gmail.com.

Reply via email to