Hi all,

I’m experiencing a problem with a channel that I’m unable to explain based 
on what I know and what I could find in the documentation. Maybe someone 
here can shed some light on the matter :)

I’m using gRPC in my Python project and seemingly out of the blue some 
tests started failing for no apparent reason due to “Connection refused”. 
Moreover, I asked two colleagues to try but they weren’t able to reproduce 
this issue on their local environments, both using MacOS x86. So far, it 
fails in my machine (Arch Linux) and the Github CI (Ubuntu 23.10).

I use pytest as the testing framework and create the gRPC channel in a 
session-scoped fixture (i.e. the channel is created once at the beginning 
of the test session and re-used throughout it).

Environment

Language/Runtime: Python 3.10.3

gRPC version: 1.59.3

OS: Arch Linux. Kernel 6.6.18-1-lts


After a long debugging session, this is what I’ve gathered so far:

   - 
   
   When run in isolation, all tests pass.
   - 
   
   If I change the scope of the fixture to function (every test creates the 
   channel and closes it at the end), all tests pass.
   - 
   
   When re-using the channel, the first few tests run fine and after one 
   point, all consecutive tests start to fail due to “Connection refused” (
   UNAVAILABLE).
   

The point after which tests start to fail seems to be consistent and is 
after the execution of a test module that takes about 11-13 seconds to run 
and doesn’t make any gRPC call.

Let’s say I test 3 modules:

   - 
   
   Module A ←uses the gRPC channel.
   - 
   
   Module B ← does not use the gRPC channel.
   - 
   
   Module C ← uses again the same gRPC channel. 
   


I monitored the status changes of the channel and it looks like somewhere 
between the execution of Module B, the gRPC channel transitions into an IDLE 
state and immediately after into TRANSIENT_FAILURE. New calls get refused.

I added the following options to the channel:

options = {

"grpc.keepalive_time_ms": 5000,  # Send keepalive ping every 5 seconds

     "grpc.keepalive_permit_without_calls": True,  # Allow keepalive pings 
when there are no calls

     "grpc.http2.max_pings_without_data": 0,  # Unlimited pings without data

}

And the problem went away.

    

However, two things keep bothering me:

    1. From what I’ve gathered, the default IDLE_TIMEOUT is somewhere 
between 5 min 
<https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html> 
and 30 min 
<https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga51ab062269cd81298f5adb6fd9a45e99>,
 
as stated in the documentation.

    2. Regardless of the IDLE_TIMEOUT, the gRPC channel should be able to 
transition from IDLE to READY and accept new calls.


I added a callback to monitor the change of states and the following 
happens:

    1. Before the first test in Module A gets executed (when the channel is 
created), the channel goes from CONNECTING to READY, as expected.

[2024-03-12 12:44:42.221478+00:00] Channel state changed to CONNECTING.

[2024-03-12 12:44:42.224098+00:00] Channel state changed to READY.

    2. Right before the first test in Module C (which uses the channel 
again), the channel goes into IDLE and immediately after into 
TRANSITION_FAILURE:

      - [2024-03-12 12:44:53.473898+00:00] Channel state changed to IDLE.

      - [2024-03-12 12:44:53.474747+00:00] Channel state changed to 
TRANSITION_FAILURE.

   

When the grpc.keepalive_time_ms is set, there is still a transition to IDLE 
state, but the channel doesn't go into TRANSITION_FAILURE state and goes 
back to READY immediately.

According to the documentation:

When there has been no RPC activity on a channel for a specified 
IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this period, 
channels that are READY or CONNECTING switch to IDLE. Additionally, 
channels that receive a GOAWAY when there are no active or pending RPCs 
should also switch to IDLE to avoid connection overload at servers that are 
attempting to shed connections. We will use a default IDLE_TIMEOUT of 300 
seconds (5 minutes).

Since tests don’t run for longer than the IDLE_TIMEOUT, I suspect it might 
have something to do with the GOAWAY, but looking at the traces I wasn’t 
able to find anything conclusive.

I’ve kind of fixed the issue by adding the keepalive options but I'd like 
to get to the bottom of this because I’m still missing something. Any ideas?

Thanks,

Daniel

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/9245532b-a056-4eff-8670-442eb31940cfn%40googlegroups.com.

Reply via email to