guanzhenxing opened a new pull request, #6274:
URL: https://github.com/apache/shenyu/pull/6274

   ## Problem Description
   
     When the Admin side health check detects an upstream as unhealthy and 
publishes a configuration update with `status=false`, the Gateway side 
completely removes that upstream from both `healthyUpstream` and 
`unhealthyUpstream` maps. This causes the Gateway to lose track of the 
upstream, preventing its independent health check from recovering the upstream 
when it becomes healthy again.
   
    ## Error Manifestation
   
     divide upstream configuration error
     CANNOT_FIND_HEALTHY_UPSTREAM_URL
   
    ## Root Cause
   
     ShenYu has **two independent health check systems:**
     1. **Admin side:** Checks upstream health and publishes configuration 
updates
     2. **Gateway side:** Independently runs health checks and maintains its 
own health state
   
     The issue occurs when:
     1. Admin's health check marks an upstream as unhealthy (`status=false`)
     2. Admin publishes a configuration update
     3. Gateway receives the update via `DivideUpstreamDataHandler`
     4. `UpstreamCacheManager.submit()` processes `status=false` upstreams
     5. Bug: Original code calls `triggerRemoveOne()` which removes the 
upstream from BOTH healthy and unhealthy maps
     6. Result: Gateway loses all tracking of this upstream - even if it 
recovers, Gateway won't know
   
     ## Solution
   
     Design Principle
   
     Gateway's health check state should be independent of Admin's 
configuration updates
   
     Core Logic Change
   
     Before:
     status=false → triggerRemoveOne() → completely removed from both maps
   
     After:
     status=false AND healthCheckEnabled=true → preserve in unhealthy map → 
continue health checking
     status=false AND healthCheckEnabled=false → remove (no monitoring needed)
   
     Changes
   
     1. UpstreamCacheManager.java
   
     Refactored `submit()` method
   
     - Extracted logic into smaller, focused methods for better maintainability
     - Fixed `ConcurrentModificationException` by creating ArrayList copy 
before iteration
   
     New method: `processOfflineUpstreams()`
   
     Handles upstreams with `status=false`:
   ```
     // If upstream was previously in unhealthy map AND health check is enabled:
     //   → Keep it in unhealthy map for continued monitoring
     // If upstream was not previously unhealthy OR health check is disabled:
     //   → Remove it (no monitoring needed)
   ```
   
     New method: `processValidUpstreams()`
   
     Handles upstreams with `status=true`:
     - Checks if upstream was previously in `unhealthyUpstream` map
     - If yes, preserves the unhealthy state instead of forcing it to healthy
     - This allows Gateway's health check to recover it naturally
   
     New method: `getCurrentUnhealthyMap()`
   
     Helper method to get current unhealthy upstreams for state preservation
   
     2. UpstreamCheckTask.java
   
     Made `putToMap()` and `removeFromMap() ` public
   
     - These methods were `private` but are needed by `UpstreamCacheManager`
     - Now allows preserving unhealthy state across configuration updates
   
     Testing
   
     Added 9 comprehensive tests to verify the fix:
   
     **UpstreamCacheManagerTest (4 new tests)**
   
     1. `testSubmitWithStatusFalsePreservesUnhealthyState`: Verifies upstreams 
with `status=false` that were previously unhealthy remain in unhealthy map
     2. `testSubmitWithNewOfflineUpstreamAddedToUnhealthy`: Verifies new 
upstreams with `status=false` are added to unhealthy map for monitoring
     3. `testSubmitPreservesUnhealthyForValidUpstream`: Verifies valid 
upstreams (`status=true`) that were previously unhealthy remain in unhealthy map
     4. `testSubmitWithHealthCheckDisabledAndStatusFalse`: Verifies upstreams 
with `healthCheckEnabled=false` are removed, not added to unhealthy map
   
     **UpstreamCheckTaskTest (5 new tests)**
   
     5. `testPutToMap`: Tests adding upstreams to healthy map
     6. `testPutToMapUnhealthy`: Tests adding upstreams to unhealthy map
     7. `testRemoveFromMap`: Tests removing upstreams from healthy map
     8. `testRemoveFromMapUnhealthy`: Tests removing upstreams from unhealthy 
map
     9. `testMoveUpstreamBetweenMaps`: Tests moving upstreams between healthy 
and unhealthy maps
   
     Test Results
   
     Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
     BUILD SUCCESS
   
     Impact
   
     Before Fix
   
     - Gateway loses unhealthy upstream tracking when Admin publishes updates
     - Recovered upstreams cannot be detected by Gateway
     - Results in `CANNOT_FIND_HEALTHY_UPSTREAM_URL` errors
   
     After Fix
   
     - Gateway preserves its independent health check state
     - Unhealthy upstreams continue to be monitored even after Admin updates
     - Gateway can automatically recover upstreams when they become healthy
     - No manual intervention required
   
     Commits
   
     - 8a0f9e958 - Fix: Preserve unhealthy upstream state when receiving config 
updates from admin
     - 78822b42 - Test: Add tests for upstream unhealthy state preservation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to