Re: [PR] fix(grpc): Add keepalive and fix reconnect issue [skywalking-java]

via GitHub Thu, 06 Nov 2025 10:51:07 -0800


hzhaop commented on PR #777:
URL: https://github.com/apache/skywalking-java/pull/777#issuecomment-3498912732


   > But if the server is back online, why the connection keeps in this 
half-open status? Not timeout, and not established? What happened actually? 
This seems not an expected status.
   
   I checked the code and found that **SkyWalking DOES set deadline for RPC 
calls**, but there's still a critical difference:
   
   ## RPC Deadline vs Keepalive Timeout
   
   ### 1. SkyWalking's RPC Configuration
   
   From `TraceSegmentServiceClient.java:92-93`:
   ```java
   StreamObserver<SegmentObject> upstreamSegmentStreamObserver = 
serviceStub.withDeadlineAfter(
       Config.Collector.GRPC_UPSTREAM_TIMEOUT, TimeUnit.SECONDS  // Default: 30 
seconds
   ).collect(...)
   ```
   
   From `Config.java:212`:
   ```java
   public static int GRPC_UPSTREAM_TIMEOUT = 30;  // 30 seconds default
   ```
   
   So RPC calls DO have 30 second timeout!
   
   ### 2. But Why Keepalive Still Needed?
   
   Here's the **ROOT CAUSE** - RPC deadline timeout doesn't trigger 
reconnection!
   
   **The Critical Code in `GRPCChannelManager.isNetworkError()` (lines 
282-291):**
   ```java
   private boolean isNetworkError(Throwable throwable) {
       if (throwable instanceof StatusRuntimeException) {
           StatusRuntimeException statusRuntimeException = 
(StatusRuntimeException) throwable;
           return statusEquals(
               statusRuntimeException.getStatus(),
               Status.UNAVAILABLE,        // ✅ Keepalive generates this
               Status.PERMISSION_DENIED,
               Status.UNAUTHENTICATED,
               Status.RESOURCE_EXHAUSTED,
               Status.UNKNOWN
               // ❌ But NOT Status.DEADLINE_EXCEEDED (RPC timeout generates 
this)
           );
       }
       return false;
   }
   ```
   
   **Without `keepAliveWithoutCalls(true)`:**
   ```
   T0:     Last RPC completes successfully
   T0-T120: No RPC calls (idle period)
           → No data sent, no way to detect connection is broken
           → TCP thinks connection is still alive (half-open)
   T120:   New RPC call triggered
           → withDeadlineAfter(30s) starts counting
           → Data goes to TCP send-queue
           → Waits for response...
   T150:   30 second deadline expires
           → RPC call fails with Status.DEADLINE_EXCEEDED
           → isNetworkError() returns FALSE ❌
           → reportError() NOT called
           → reconnect stays FALSE
           → Half-open connection persists
   T180:   Another RPC triggered
           → Same cycle repeats...
           → Infinite loop!
   ```
   
   **With `keepAliveWithoutCalls(true)`:**
   ```
   T0:     Last RPC completes
   T120:   Keepalive PING sent (even during idle)
           → No PONG response
   T150:   Keepalive timeout (30s)
           → Throws: Status.UNAVAILABLE: Keepalive failed ✅
           → isNetworkError() returns TRUE
           → reportError() called → reconnect = true
           → State becomes TRANSIENT_FAILURE
           → transientFailureCount++
   T180:   Next run() cycle
           → reconnect = true, forces channel rebuild
           → Breaks the half-open connection!
   ```
   
   **Evidence from production logs:**
   ```
   2025-10-31 09:34:40.081 ERROR JVMMetricsSender : send JVM metrics to 
Collector fail.
   org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
     UNAVAILABLE: Keepalive failed. The connection is likely gone
   ```
   
   ### 3. The Real Problem
   
   The issue is **what happens BETWEEN RPC calls**:
   
   | Scenario | Without keepalive | With keepalive |
   |----------|-------------------|----------------|
   | Idle period (no RPC) | No detection mechanism | PING detects failure |
   | Half-open connection | Persists undetected | Detected within 150s |
   | Next RPC call | Goes to broken connection | Blocked by TRANSIENT_FAILURE 
state |
   | Result | Deadline expires, but reconnect logic fails due to 
`isConnected(false)` returning true | Connection state is accurate |
   
   ### 4. Why My Production Showed Growing Send-Queue
   
   In my production environment, the send-queue growth happened because:
   1. RPC deadline is 30 seconds
   2. But during idle periods (no RPC), half-open connection wasn't detected
   3. When burst of RPC calls came, they all queued up
   4. Each call would timeout after 30s, but new calls kept coming
   5. The cycle of "timeout → reconnect attempt → isConnected(true) → 
markAsConnected()" kept the broken connection
   
   **Keepalive solves this by detecting the broken connection BEFORE the next 
RPC burst arrives.**
   
   So the answer is: **RPC deadline handles per-call timeout, but keepalive 
handles connection health monitoring during idle periods.** Both are needed!
   
   ---
   
   ## Additional Discussion Point
   
   **Should `DEADLINE_EXCEEDED` be treated as a network error?**
   
   Currently, `isNetworkError()` doesn't include `Status.DEADLINE_EXCEEDED`, 
which means RPC timeouts don't trigger reconnection. This is actually correct 
behavior in most cases:
   
   - **DEADLINE_EXCEEDED** = application-level timeout (request took too long)
     - Could be due to slow server processing
     - Not necessarily a connection problem
     - Example: Complex query times out after 30s
   
   - **UNAVAILABLE** = transport-level failure (connection broken)
     - Definitely a network/connection problem
     - Should always trigger reconnection
     - Example: Keepalive PING fails
   
   However, in the specific case of **half-open connections**, 
DEADLINE_EXCEEDED becomes misleading because:
   1. The request didn't actually reach the server (connection is broken)
   2. But it's reported as a timeout (not a connection error)
   3. This prevents proper reconnection logic from kicking in
   
   **My current fix solves this by:**
   - Adding `keepAliveWithoutCalls(true)` to detect half-open connections 
proactively
   - Using TRANSIENT_FAILURE state monitoring as a backup safeguard
   - This way we don't need to change the semantic meaning of DEADLINE_EXCEEDED
   
   **Alternative approach (for consideration):**
   Add `Status.DEADLINE_EXCEEDED` to `isNetworkError()`, but this might cause 
false positives where legitimate slow requests trigger unnecessary 
reconnections.
   
   What do you think? Should we keep the current approach or also add 
DEADLINE_EXCEEDED to network errors?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(grpc): Add keepalive and fix reconnect issue [skywalking-java]

Reply via email to