guanzhenxing opened a new pull request, #6274:
URL: https://github.com/apache/shenyu/pull/6274
## Problem Description
When the Admin side health check detects an upstream as unhealthy and
publishes a configuration update with `status=false`, the Gateway side
completely removes that upstream from both `healthyUpstream` and
`unhealthyUpstream` maps. This causes the Gateway to lose track of the
upstream, preventing its independent health check from recovering the upstream
when it becomes healthy again.
## Error Manifestation
divide upstream configuration error
CANNOT_FIND_HEALTHY_UPSTREAM_URL
## Root Cause
ShenYu has **two independent health check systems:**
1. **Admin side:** Checks upstream health and publishes configuration
updates
2. **Gateway side:** Independently runs health checks and maintains its
own health state
The issue occurs when:
1. Admin's health check marks an upstream as unhealthy (`status=false`)
2. Admin publishes a configuration update
3. Gateway receives the update via `DivideUpstreamDataHandler`
4. `UpstreamCacheManager.submit()` processes `status=false` upstreams
5. Bug: Original code calls `triggerRemoveOne()` which removes the
upstream from BOTH healthy and unhealthy maps
6. Result: Gateway loses all tracking of this upstream - even if it
recovers, Gateway won't know
## Solution
Design Principle
Gateway's health check state should be independent of Admin's
configuration updates
Core Logic Change
Before:
status=false → triggerRemoveOne() → completely removed from both maps
After:
status=false AND healthCheckEnabled=true → preserve in unhealthy map →
continue health checking
status=false AND healthCheckEnabled=false → remove (no monitoring needed)
Changes
1. UpstreamCacheManager.java
Refactored `submit()` method
- Extracted logic into smaller, focused methods for better maintainability
- Fixed `ConcurrentModificationException` by creating ArrayList copy
before iteration
New method: `processOfflineUpstreams()`
Handles upstreams with `status=false`:
```
// If upstream was previously in unhealthy map AND health check is enabled:
// → Keep it in unhealthy map for continued monitoring
// If upstream was not previously unhealthy OR health check is disabled:
// → Remove it (no monitoring needed)
```
New method: `processValidUpstreams()`
Handles upstreams with `status=true`:
- Checks if upstream was previously in `unhealthyUpstream` map
- If yes, preserves the unhealthy state instead of forcing it to healthy
- This allows Gateway's health check to recover it naturally
New method: `getCurrentUnhealthyMap()`
Helper method to get current unhealthy upstreams for state preservation
2. UpstreamCheckTask.java
Made `putToMap()` and `removeFromMap() ` public
- These methods were `private` but are needed by `UpstreamCacheManager`
- Now allows preserving unhealthy state across configuration updates
Testing
Added 9 comprehensive tests to verify the fix:
**UpstreamCacheManagerTest (4 new tests)**
1. `testSubmitWithStatusFalsePreservesUnhealthyState`: Verifies upstreams
with `status=false` that were previously unhealthy remain in unhealthy map
2. `testSubmitWithNewOfflineUpstreamAddedToUnhealthy`: Verifies new
upstreams with `status=false` are added to unhealthy map for monitoring
3. `testSubmitPreservesUnhealthyForValidUpstream`: Verifies valid
upstreams (`status=true`) that were previously unhealthy remain in unhealthy map
4. `testSubmitWithHealthCheckDisabledAndStatusFalse`: Verifies upstreams
with `healthCheckEnabled=false` are removed, not added to unhealthy map
**UpstreamCheckTaskTest (5 new tests)**
5. `testPutToMap`: Tests adding upstreams to healthy map
6. `testPutToMapUnhealthy`: Tests adding upstreams to unhealthy map
7. `testRemoveFromMap`: Tests removing upstreams from healthy map
8. `testRemoveFromMapUnhealthy`: Tests removing upstreams from unhealthy
map
9. `testMoveUpstreamBetweenMaps`: Tests moving upstreams between healthy
and unhealthy maps
Test Results
Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS
Impact
Before Fix
- Gateway loses unhealthy upstream tracking when Admin publishes updates
- Recovered upstreams cannot be detected by Gateway
- Results in `CANNOT_FIND_HEALTHY_UPSTREAM_URL` errors
After Fix
- Gateway preserves its independent health check state
- Unhealthy upstreams continue to be monitored even after Admin updates
- Gateway can automatically recover upstreams when they become healthy
- No manual intervention required
Commits
- 8a0f9e958 - Fix: Preserve unhealthy upstream state when receiving config
updates from admin
- 78822b42 - Test: Add tests for upstream unhealthy state preservation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]