[grpc-io] Re: gRCP Kubernetes Timeouts

'Lidi Zheng' via grpc.io Wed, 25 Aug 2021 11:15:34 -0700

Hi,

gRPC's trace flags might help: 
https://github.com/grpc/grpc/blob/master/doc/environment_variables.md


I would recommend "GRPC_VERBOSITY=debug GRPC_TRACE=api" as a starting point.

The RPC latency could be a result of many factors. E.g., the pod's resource 
limit, the network environment. With the tracing flag, we might be able to 
see why a particular RPC failed, is it name resolution or failed to connect 
to endpoints.

If nothing wrong is observed in trace log, we could use "netstat" to check 
packet loss, or "tcpdump" to see where the traffic is actually going.

On Tuesday, August 24, 2021 at 3:28:31 PM UTC-7 Aaron Pelz wrote:

> Hi all, 
>
> New to the group and a little newer to gRPC in general, so hoping to get 
> some pointers on where to dig deeper.
>
> Currently we have two services that interact via GRPC. Their client and 
> server libraries are generated by the protoc compiler. Our server side runs 
> python (grpcio 1.37.0) and our client side runs Node (grpc-js 1.3.7). Both 
> sides use Kubernetes deployed in the cloud. We chose GRPC for its typed 
> contracts, polyglot server/client code generation, speed, and, most 
> importantly, its HTTP2 support for long-running, stateful, lightweight 
> connections.
>
> Our setup uses Contour/Envoy load balancing.
>
> The issue right now is that we have requests that
>   1) sometimes never even make it to the server side (i.e. they hit their 
> client-specified timeout with DEADLINE_EXCEEDED)
>   2) take a variable amount of time to get ack'd by the server side (this 
> may not necessarily be an issue but perhaps it is telling as to what is 
> happening)
>     - This range is quite jarring and varies per GRPC method:
>       - Method 1 (initialize) - p75 43ms, p95 208ms, p99 610ms, max 750ms
>       - Method 2 (run)        - p75 8ms,  p95 13ms,  p99 35ms,  max 168ms
>       - Method 3 (abandon)    - p75 7ms,  p95 34ms,  p99 69ms,  max 285ms
>
> When we see requests that never make it to the server side, the requests 
> simply hit their client-specified timeout window without being 
> acknowledged. For these unsuccessful requests, we also never see a 
> corresponding Envoy log. Sometimes the connection happens to only have a 
> one-off network error like this that goes away, however, we had a recent 
> incident where one of our client pods (in a Kubernetes cluster) could not 
> reach the server at all and thousands of requests failed consecutively over 
> ~24 hours. Memory space was quickly lost and we had to restart the pod.
>
> It is worth it to note that some of our processes (e.g. calls to `run` 
> method) take up to 5 minutes to process on the server side. Every other 
> method should take less than ~1s to complete.
>
> Any ideas on where this could be happening, or if there are extra systems 
> we can add telemetry to would be helpful!
>
> Cheers,
> Aaron
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/a1ea3563-b2fa-4be3-b6f3-6cae6920155en%40googlegroups.com.

[grpc-io] Re: gRCP Kubernetes Timeouts

Reply via email to