hzhaop commented on PR #777:
URL: https://github.com/apache/skywalking-java/pull/777#issuecomment-3510207871

   > A key question before checking codes, as you said, `TRANSIENT_FAILURE` 
represents `Next Call goes to broken connection, and gRPC returns this.` Why 
doesn't treat this as a network error directly? You created another method and 
configurations to verify this status, and also trigger the reconnection like 
other network issue. It seems no different to me. Did I miss anything?
   
   You're absolutely right! After reviewing the code more carefully, I realized 
that monitoring TRANSIENT_FAILURE state is indeed redundant.
   
   Here's why:
   1. When the channel is in TRANSIENT_FAILURE state, any RPC call will 
immediately throw an UNAVAILABLE exception
   2. UNAVAILABLE is already in the `isNetworkError()` list, which triggers 
`reportError()` and reconnection
   3. So the TRANSIENT_FAILURE monitoring provides no additional benefit
   
   I've removed the `transientFailureCount` mechanism and 
`checkChannelStateAndTriggerReconnectIfNeeded()` method in the latest commits.
   
   ## The Real Problem and Solution
   
   After analyzing production logs, I found the actual issue was with the 
**original reconnection logic**:
   
   **Original code problem:**
   ```java
   } else if (managedChannel.isConnected(++reconnectCount > 5)) {
       reconnectCount = 0;  // Reset counter when isConnected() returns true
       reconnect = false;
   }
   ```
   
   When `isConnected()` returns true (even if it's a false positive due to 
half-open connection), the counter gets reset. This means:
   - If the connection is in a half-open state, `isConnected()` may return true
   - `reconnectCount` gets reset to 0 every time
   - The counter never reaches the threshold
   - Channel never gets rebuilt, even though the connection is actually broken
   
   **Current solution:**
   ```java
   if (reconnectCount > Config.Agent.FORCE_RECONNECTION_PERIOD) {
       // Force rebuild channel
       createNewChannel(...);
   } else if (managedChannel.isConnected(false)) {
       // Trust the connection but DON'T reset reconnectCount
       notifyConnected();
   }
   ```
   
   Now `reconnectCount` only resets in `createNewChannel()` after actual 
channel rebuild. This ensures:
   - Even if `isConnected()` returns false positives, the counter keeps 
accumulating
   - After threshold is exceeded, channel is forcibly rebuilt
   - Half-open connections are resolved within predictable timeframe (threshold 
× check_interval)
   
   This matches the original intent of `FORCE_RECONNECTION_PERIOD` 
configuration.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to