[grpc-io] Channel going into TRANSITION_FAILURE unless keepalive is used

Daniel Rivas Wed, 13 Mar 2024 13:10:57 -0700


Hi all,

I’m experiencing a problem with a channel that I’m unable to explain based
on what I know and what I could find in the documentation. Maybe someone
here can shed some light on the matter :)

I’m using gRPC in my Python project and seemingly out of the blue some
tests started failing for no apparent reason due to “Connection refused”.
Moreover, I asked two colleagues to try but they weren’t able to reproduce
this issue on their local environments, both using MacOS x86. So far, it
fails in my machine (Arch Linux) and the Github CI (Ubuntu 23.10).

I use pytest as the testing framework and create the gRPC channel in a
session-scoped fixture (i.e. the channel is created once at the beginning
of the test session and re-used throughout it).

Environment

Language/Runtime: Python 3.10.3

gRPC version: 1.59.3

OS: Arch Linux. Kernel 6.6.18-1-lts

After a long debugging session, this is what I’ve gathered so far:

When run in isolation, all tests pass.
-

If I change the scope of the fixture to function (every test creates the
channel and closes it at the end), all tests pass.
-

When re-using the channel, the first few tests run fine and after one
point, all consecutive tests start to fail due to “Connection refused” (
UNAVAILABLE).

The point after which tests start to fail seems to be consistent and is
after the execution of a test module that takes about 11-13 seconds to run
and doesn’t make any gRPC call.

Let’s say I test 3 modules:

Module A ←uses the gRPC channel.
-

Module B ← does not use the gRPC channel.
-

Module C ← uses again the same gRPC channel.

I monitored the status changes of the channel and it looks like somewhere
between the execution of Module B, the gRPC channel transitions into an IDLE
state and immediately after into TRANSIENT_FAILURE. New calls get refused.

I added the following options to the channel:

options = {

"grpc.keepalive_time_ms": 5000, # Send keepalive ping every 5 seconds

"grpc.keepalive_permit_without_calls": True, # Allow keepalive pings
when there are no calls

"grpc.http2.max_pings_without_data": 0, # Unlimited pings without data

}

And the problem went away.

However, two things keep bothering me:

1. From what I’ve gathered, the default IDLE_TIMEOUT is somewhere
between 5 min
<https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html>
and 30 min
<https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga51ab062269cd81298f5adb6fd9a45e99>,

as stated in the documentation.

2. Regardless of the IDLE_TIMEOUT, the gRPC channel should be able to
transition from IDLE to READY and accept new calls.

I added a callback to monitor the change of states and the following
happens:

1. Before the first test in Module A gets executed (when the channel is
created), the channel goes from CONNECTING to READY, as expected.

[2024-03-12 12:44:42.221478+00:00] Channel state changed to CONNECTING.

[2024-03-12 12:44:42.224098+00:00] Channel state changed to READY.

2. Right before the first test in Module C (which uses the channel
again), the channel goes into IDLE and immediately after into
TRANSITION_FAILURE:

- [2024-03-12 12:44:53.473898+00:00] Channel state changed to IDLE.

- [2024-03-12 12:44:53.474747+00:00] Channel state changed to
TRANSITION_FAILURE.

When the grpc.keepalive_time_ms is set, there is still a transition to IDLE
state, but the channel doesn't go into TRANSITION_FAILURE state and goes
back to READY immediately.

According to the documentation:

When there has been no RPC activity on a channel for a specified
IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this period,
channels that are READY or CONNECTING switch to IDLE. Additionally,
channels that receive a GOAWAY when there are no active or pending RPCs
should also switch to IDLE to avoid connection overload at servers that are
attempting to shed connections. We will use a default IDLE_TIMEOUT of 300
seconds (5 minutes).

Since tests don’t run for longer than the IDLE_TIMEOUT, I suspect it might
have something to do with the GOAWAY, but looking at the traces I wasn’t
able to find anything conclusive.

I’ve kind of fixed the issue by adding the keepalive options but I'd like
to get to the bottom of this because I’m still missing something. Any ideas?

Thanks,

Daniel

--
You received this message because you are subscribed to the Google Groups
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/grpc-io/9245532b-a056-4eff-8670-442eb31940cfn%40googlegroups.com.

[grpc-io] Channel going into TRANSITION_FAILURE unless keepalive is used

Reply via email to