Hi all, 

New to the group and a little newer to gRPC in general, so hoping to get 
some pointers on where to dig deeper.

Currently we have two services that interact via GRPC. Their client and 
server libraries are generated by the protoc compiler. Our server side runs 
python (grpcio 1.37.0) and our client side runs Node (grpc-js 1.3.7). Both 
sides use Kubernetes deployed in the cloud. We chose GRPC for its typed 
contracts, polyglot server/client code generation, speed, and, most 
importantly, its HTTP2 support for long-running, stateful, lightweight 
connections.

Our setup uses Contour/Envoy load balancing.

The issue right now is that we have requests that
  1) sometimes never even make it to the server side (i.e. they hit their 
client-specified timeout with DEADLINE_EXCEEDED)
  2) take a variable amount of time to get ack'd by the server side (this 
may not necessarily be an issue but perhaps it is telling as to what is 
happening)
    - This range is quite jarring and varies per GRPC method:
      - Method 1 (initialize) - p75 43ms, p95 208ms, p99 610ms, max 750ms
      - Method 2 (run)        - p75 8ms,  p95 13ms,  p99 35ms,  max 168ms
      - Method 3 (abandon)    - p75 7ms,  p95 34ms,  p99 69ms,  max 285ms

When we see requests that never make it to the server side, the requests 
simply hit their client-specified timeout window without being 
acknowledged. For these unsuccessful requests, we also never see a 
corresponding Envoy log. Sometimes the connection happens to only have a 
one-off network error like this that goes away, however, we had a recent 
incident where one of our client pods (in a Kubernetes cluster) could not 
reach the server at all and thousands of requests failed consecutively over 
~24 hours. Memory space was quickly lost and we had to restart the pod.

It is worth it to note that some of our processes (e.g. calls to `run` 
method) take up to 5 minutes to process on the server side. Every other 
method should take less than ~1s to complete.

Any ideas on where this could be happening, or if there are extra systems 
we can add telemetry to would be helpful!

Cheers,
Aaron

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/e4255f2c-33f3-4850-8fe0-a8dfd0a3cdc9n%40googlegroups.com.

Reply via email to