[grpc-io] gRCP Kubernetes Timeouts

Aaron Pelz Tue, 24 Aug 2021 15:28:35 -0700

Hi all, 

New to the group and a little newer to gRPC in general, so hoping to get 
some pointers on where to dig deeper.

Currently we have two services that interact via GRPC. Their client and
server libraries are generated by the protoc compiler. Our server side runs
python (grpcio 1.37.0) and our client side runs Node (grpc-js 1.3.7). Both
sides use Kubernetes deployed in the cloud. We chose GRPC for its typed
contracts, polyglot server/client code generation, speed, and, most
importantly, its HTTP2 support for long-running, stateful, lightweight
connections.

Our setup uses Contour/Envoy load balancing.

The issue right now is that we have requests that
1) sometimes never even make it to the server side (i.e. they hit their
client-specified timeout with DEADLINE_EXCEEDED)
2) take a variable amount of time to get ack'd by the server side (this
may not necessarily be an issue but perhaps it is telling as to what is
happening)
- This range is quite jarring and varies per GRPC method:
- Method 1 (initialize) - p75 43ms, p95 208ms, p99 610ms, max 750ms
- Method 2 (run) - p75 8ms, p95 13ms, p99 35ms, max 168ms
- Method 3 (abandon) - p75 7ms, p95 34ms, p99 69ms, max 285ms

When we see requests that never make it to the server side, the requests
simply hit their client-specified timeout window without being
acknowledged. For these unsuccessful requests, we also never see a
corresponding Envoy log. Sometimes the connection happens to only have a
one-off network error like this that goes away, however, we had a recent
incident where one of our client pods (in a Kubernetes cluster) could not
reach the server at all and thousands of requests failed consecutively over
~24 hours. Memory space was quickly lost and we had to restart the pod.

It is worth it to note that some of our processes (e.g. calls to `run`
method) take up to 5 minutes to process on the server side. Every other
method should take less than ~1s to complete.

Any ideas on where this could be happening, or if there are extra systems
we can add telemetry to would be helpful!

Cheers,
Aaron

--
You received this message because you are subscribed to the Google Groups
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/grpc-io/e4255f2c-33f3-4850-8fe0-a8dfd0a3cdc9n%40googlegroups.com.

[grpc-io] gRCP Kubernetes Timeouts

Reply via email to