Thanks for the tip! We didn't find anything particularly insightful from the logs, but we did find an issue with a load balancer timeout that we've tweaked and the issues have reduced in frequency.
On Wednesday, August 25, 2021 at 2:15:29 PM UTC-4 Lidi Zheng wrote: > Hi, > > gRPC's trace flags might help: > https://github.com/grpc/grpc/blob/master/doc/environment_variables.md > > I would recommend "GRPC_VERBOSITY=debug GRPC_TRACE=api" as a starting > point. > > The RPC latency could be a result of many factors. E.g., the pod's > resource limit, the network environment. With the tracing flag, we might be > able to see why a particular RPC failed, is it name resolution or failed to > connect to endpoints. > > If nothing wrong is observed in trace log, we could use "netstat" to check > packet loss, or "tcpdump" to see where the traffic is actually going. > > On Tuesday, August 24, 2021 at 3:28:31 PM UTC-7 Aaron Pelz wrote: > >> Hi all, >> >> New to the group and a little newer to gRPC in general, so hoping to get >> some pointers on where to dig deeper. >> >> Currently we have two services that interact via GRPC. Their client and >> server libraries are generated by the protoc compiler. Our server side runs >> python (grpcio 1.37.0) and our client side runs Node (grpc-js 1.3.7). Both >> sides use Kubernetes deployed in the cloud. We chose GRPC for its typed >> contracts, polyglot server/client code generation, speed, and, most >> importantly, its HTTP2 support for long-running, stateful, lightweight >> connections. >> >> Our setup uses Contour/Envoy load balancing. >> >> The issue right now is that we have requests that >> 1) sometimes never even make it to the server side (i.e. they hit their >> client-specified timeout with DEADLINE_EXCEEDED) >> 2) take a variable amount of time to get ack'd by the server side (this >> may not necessarily be an issue but perhaps it is telling as to what is >> happening) >> - This range is quite jarring and varies per GRPC method: >> - Method 1 (initialize) - p75 43ms, p95 208ms, p99 610ms, max 750ms >> - Method 2 (run) - p75 8ms, p95 13ms, p99 35ms, max 168ms >> - Method 3 (abandon) - p75 7ms, p95 34ms, p99 69ms, max 285ms >> >> When we see requests that never make it to the server side, the requests >> simply hit their client-specified timeout window without being >> acknowledged. For these unsuccessful requests, we also never see a >> corresponding Envoy log. Sometimes the connection happens to only have a >> one-off network error like this that goes away, however, we had a recent >> incident where one of our client pods (in a Kubernetes cluster) could not >> reach the server at all and thousands of requests failed consecutively over >> ~24 hours. Memory space was quickly lost and we had to restart the pod. >> >> It is worth it to note that some of our processes (e.g. calls to `run` >> method) take up to 5 minutes to process on the server side. Every other >> method should take less than ~1s to complete. >> >> Any ideas on where this could be happening, or if there are extra systems >> we can add telemetry to would be helpful! >> >> Cheers, >> Aaron >> >> -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/f9acda7f-4d67-49a4-a42d-4b4dddf06131n%40googlegroups.com.
