Thanks for the tip!  We didn't find anything particularly insightful from 
the logs, but we did find an issue with a load balancer timeout that we've 
tweaked and the issues have reduced in frequency. 

On Wednesday, August 25, 2021 at 2:15:29 PM UTC-4 Lidi Zheng wrote:

> Hi,
>
> gRPC's trace flags might help: 
> https://github.com/grpc/grpc/blob/master/doc/environment_variables.md
>
> I would recommend "GRPC_VERBOSITY=debug GRPC_TRACE=api" as a starting 
> point.
>
> The RPC latency could be a result of many factors. E.g., the pod's 
> resource limit, the network environment. With the tracing flag, we might be 
> able to see why a particular RPC failed, is it name resolution or failed to 
> connect to endpoints.
>
> If nothing wrong is observed in trace log, we could use "netstat" to check 
> packet loss, or "tcpdump" to see where the traffic is actually going.
>
> On Tuesday, August 24, 2021 at 3:28:31 PM UTC-7 Aaron Pelz wrote:
>
>> Hi all, 
>>
>> New to the group and a little newer to gRPC in general, so hoping to get 
>> some pointers on where to dig deeper.
>>
>> Currently we have two services that interact via GRPC. Their client and 
>> server libraries are generated by the protoc compiler. Our server side runs 
>> python (grpcio 1.37.0) and our client side runs Node (grpc-js 1.3.7). Both 
>> sides use Kubernetes deployed in the cloud. We chose GRPC for its typed 
>> contracts, polyglot server/client code generation, speed, and, most 
>> importantly, its HTTP2 support for long-running, stateful, lightweight 
>> connections.
>>
>> Our setup uses Contour/Envoy load balancing.
>>
>> The issue right now is that we have requests that
>>   1) sometimes never even make it to the server side (i.e. they hit their 
>> client-specified timeout with DEADLINE_EXCEEDED)
>>   2) take a variable amount of time to get ack'd by the server side (this 
>> may not necessarily be an issue but perhaps it is telling as to what is 
>> happening)
>>     - This range is quite jarring and varies per GRPC method:
>>       - Method 1 (initialize) - p75 43ms, p95 208ms, p99 610ms, max 750ms
>>       - Method 2 (run)        - p75 8ms,  p95 13ms,  p99 35ms,  max 168ms
>>       - Method 3 (abandon)    - p75 7ms,  p95 34ms,  p99 69ms,  max 285ms
>>
>> When we see requests that never make it to the server side, the requests 
>> simply hit their client-specified timeout window without being 
>> acknowledged. For these unsuccessful requests, we also never see a 
>> corresponding Envoy log. Sometimes the connection happens to only have a 
>> one-off network error like this that goes away, however, we had a recent 
>> incident where one of our client pods (in a Kubernetes cluster) could not 
>> reach the server at all and thousands of requests failed consecutively over 
>> ~24 hours. Memory space was quickly lost and we had to restart the pod.
>>
>> It is worth it to note that some of our processes (e.g. calls to `run` 
>> method) take up to 5 minutes to process on the server side. Every other 
>> method should take less than ~1s to complete.
>>
>> Any ideas on where this could be happening, or if there are extra systems 
>> we can add telemetry to would be helpful!
>>
>> Cheers,
>> Aaron
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/f9acda7f-4d67-49a4-a42d-4b4dddf06131n%40googlegroups.com.

Reply via email to