I think I have a theory that explains what happened. 
This issue is caused by https://github.com/grpc/grpc-java/issues/3545.
In grpc 1.5, there is a race condition: when channel goes idle, load 
balancer is shutdown and sub channels are scheduled to shutdown in 5 sec. 
If some rpc come and succeeded within this 5 sec, new load balancer and 
sub-channel will be created. 
However, when old sub-channel shuts down, old load balancer will be 
notified about state change and update channel picker with an empty list of 
channels. 
Now if some rpc is made before channel goes idle again, that rpc will be 
buffered in delayed transport forever.
Following code reproduces the scenario above:


public static void main(String[] args) throws Exception {
  ServerBuilder.forPort(12345)
      .addService(new GreeterImpl().bindService())
      .build()
      .start();
  Channel channel = NettyChannelBuilder.forTarget("localhost:12345")
      .idleTimeout(1, TimeUnit.SECONDS)
      .negotiationType(NegotiationType.PLAINTEXT)
      .loadBalancerFactory(RoundRobinLoadBalancerFactory.getInstance())
      .usePlaintext(true)
      .build();
  GreeterBlockingStub stub = GreeterGrpc.newBlockingStub(channel);
  stub.sayHello(HelloRequest.getDefaultInstance());
  Thread.sleep(5500); // idle mode timer runs after 1 sec, sub-channel will 
shutdown after 6 sec
  stub.sayHello(HelloRequest.getDefaultInstance()); // connection reestablished 
and rpc succeeded
  Thread.sleep(600); // wait for channel shuts down, bad channel picker got set
  stub.sayHello(HelloRequest.getDefaultInstance()); // if I make another rpc 
before channel goes idle, it will never return
}


The bug is fixed in 1.7 because following change protects channel picker 
from being updated by load balancer after shutdown.
https://github.com/grpc/grpc-java/pull/3300/files#diff-354509714620f5c493a6810aab9419f2R684


On Wednesday, November 1, 2017 at 2:08:25 PM UTC-7, Carl Mastrangelo wrote:
>
> All the methods should be overriden.  The idle connection time is likely 
> that timeout.  
>
> On Monday, October 30, 2017 at 2:00:59 PM UTC-7, [email protected] wrote:
>>
>> Some findings to share, not sure if related though:
>> 1. we use custom grpc name resolver, and refresh() method is not 
>> overridden (default is no-op)
>> 2. in the past occurrences, before DEADLINE_EXCEEDED starts happening, 
>> there is ALWAYS a 30 minutes gap between the last succeeded and the second 
>> last succeeded grpc from the broken client.
>>
>> 30 minutes seems to be the default connection idle timeout, which is what 
>> we use. Not sure if it is related though.
>>
>> On Thursday, October 19, 2017 at 11:09:26 AM UTC-7, Bi Ran wrote:
>>>
>>> The client automatically recovered after a few hours.
>>> This is the only thread that is relevant. 
>>> I did not explicitly set waitForReady, and I think by default it is 
>>> disabled.
>>> I can provide more thread dumps when it happen again next time. Let me 
>>> know if any information other than thread dump is needed.
>>>
>>>
>>> On Thu, Oct 19, 2017 at 10:31 AM 'Carl Mastrangelo' via grpc.io <
>>> [email protected]> wrote:
>>>
>>>> Is that the only thread?  Also, are you using waitForReady?
>>>>
>>>>
>>>> On Wednesday, October 18, 2017 at 3:40:19 PM UTC-7, [email protected] 
>>>> wrote:
>>>>>
>>>>> The problem happened to me again even with keep-alive.
>>>>> netstats suggests that underlying tcp connection is established.
>>>>> Client thread dump follows:
>>>>>    java.lang.Thread.State: WAITING (parking)
>>>>>         at sun.misc.Unsafe.park(Native Method)
>>>>>         - parking to wait for  <0x00007f1624000038> (a 
>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>>>>         at 
>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>>>>         at 
>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>>>>>         at 
>>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>>>>         at 
>>>>> io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:572)
>>>>>         at 
>>>>> io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:120)
>>>>>
>>>>> All server threads are idle. It is likely server receives no request.
>>>>>
>>>>>
>>>>> On Wednesday, October 11, 2017 at 7:57:28 AM UTC-7, Taehyun Park wrote:
>>>>>>
>>>>>> This is what I did to avoid this problem in production. I wrapped all 
>>>>>> grpc calls with RxJava and used Retry to re-initialize channel when 
>>>>>> DEADLINE_EXCEEDED is thrown.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tuesday, September 26, 2017 at 7:19:14 AM UTC+9, [email protected] 
>>>>>> wrote:
>>>>>>>
>>>>>>> Language: java
>>>>>>> Version: 1.5
>>>>>>>
>>>>>>> I ran into weird issue multiple times recently: all RPCs from one 
>>>>>>> client failed with DEADLINE_EXCEEDED. From server log, it looks like 
>>>>>>> these 
>>>>>>> failed requests didn't arrive at server at all. Other GRPC clients 
>>>>>>> worked 
>>>>>>> fine during that time. The issue was fixed by restarting client 
>>>>>>> application.
>>>>>>> Keep-alive feature is not used in client. From my understanding, 
>>>>>>> client channel should manage the underlying connection properly even 
>>>>>>> keep-alive is off.
>>>>>>> This issue happens occasionally and I haven't find a reliable way to 
>>>>>>> reproduce.
>>>>>>>
>>>>>> -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "grpc.io" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/grpc-io/0t9n1-2GTms/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/grpc-io.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/grpc-io/178d7d15-fc27-4c0a-b53d-39f48fc6f9f9%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/grpc-io/178d7d15-fc27-4c0a-b53d-39f48fc6f9f9%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/e52161b5-0918-42d4-b3fc-2992595f48ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to