mochengqian opened a new pull request, #925:
URL: https://github.com/apache/dubbo-go-pixiu/pull/925

   Implements Issue #905 Step 5: introduce runtime healthy endpoint snapshots.
   
   This PR moves request-path endpoint health reads from mutable 
`Endpoint.UnHealthy` filtering to runtime-published endpoint snapshots. Runtime 
clusters now publish immutable snapshot views for all endpoints, healthy 
endpoints, endpoint lookup by ID, healthy lookup by ID, and per-endpoint 
runtime state.
   
   This PR also fixes boundary cases required by the snapshot transition:
   - health checks publish ID/address-scoped health events instead of mutating 
endpoint config in runtime clusters
   - stale health events from replaced endpoint addresses are ignored
   - endpoint refresh/rebuild preserves health and runtime state only for the 
same endpoint ID and address
   - snapshot accessors return defensive endpoint slices
   - LLM fallback cooldown state moves from `Endpoint.Metadata` to runtime 
endpoint state
   - legacy `LoadBalancer` implementations remain source-compatible through an 
adapter path
   
   Issue #905 step progress:
   - [x] Add tests and benchmarks
   - [x] Fix current correctness issues
   - [x] Tighten runtime consistency
   - [x] Switch cluster lookup to O(1)
   - [x] Introduce healthy endpoint snapshots
   - [ ] Optimize simple LB hot path
   - [ ] Optimize consistent-hash LB last
   
   ### Why
   
   Before this PR, `Endpoint.UnHealthy` served two roles:
   
   1. Config seed state when endpoints enter a cluster.
   2. Mutable runtime health state updated by health checks.
   
   That made request paths depend on mutable endpoint config and repeatedly 
filter healthy endpoints through `ClusterConfig.GetEndpoint(true)`.
   
   LLM fallback had a similar boundary problem: fallback cooldown state was 
stored in `Endpoint.Metadata`, which is endpoint config shared through 
snapshots. Concurrent request paths could therefore mutate shared endpoint 
config while runtime snapshots exposed the same endpoint pointers.
   
   This PR separates those concerns:
   - endpoint config remains the source of desired endpoint membership and 
initial seed state
   - runtime snapshots become the source of request-path health state
   - runtime-only mutable endpoint data lives in `EndpointRuntimeState`
   
   ### How
   
   Add `EndpointSnapshot` to runtime clusters and publish snapshots through 
`atomic.Pointer`.
   
   Snapshot views include:
   - all endpoints
   - healthy endpoints
   - endpoint lookup by ID
   - healthy endpoint lookup by ID
   - endpoint address by ID
   - runtime endpoint state by ID/address
   
   Health checks now use callback events carrying endpoint ID, endpoint 
address, and health state. Runtime clusters apply those events by replacing 
snapshot health views instead of mutating `Endpoint.UnHealthy`.
   
   Direct `CreateHealthCheck()` callers keep legacy behavior: when no callback 
is installed, health checks still update `Endpoint.UnHealthy`.
   
   Endpoint refresh is CAS-based. If another health update wins the race, 
refresh rebuilds from the latest snapshot instead of overwriting newer health 
state.
   
   Runtime replacement freezes old health events with 
`SnapshotForRuntimeReplacement()` and constructs the replacement cluster with 
the inherited snapshot before new health checks start.
   
   Runtime state inheritance is guarded by endpoint ID and address:
   - same ID and same address: preserve runtime health/state
   - same ID but new address: drop old runtime state
   - deleted endpoint: drop old runtime state
   
   `PickEndpoint()` now reads the runtime healthy snapshot. The single-endpoint 
fast path only applies when the configured runtime snapshot has exactly one 
endpoint; degraded multi-endpoint clusters still invoke the configured load 
balancer.
   
   `PickNextEndpoint()` preserves configured endpoint order but skips endpoints 
that are unhealthy in the current snapshot.
   
   Built-in load balancers implement an optional `SnapshotLoadBalancer` path. 
The exported `LoadBalancer` interface remains unchanged for out-of-tree 
balancers.
   
   Legacy load balancers are adapted by passing a shallow copied config with 
the current healthy endpoint view. The adapter serializes legacy picks and 
reconciles `PrePickEndpointIndex` back to the real cluster config for 
cursor-style compatibility.
   
   LLM fallback cooldown state now uses address-guarded `EndpointRuntimeState`. 
Cooldown pair writes use `StoreMany`, reads use `LoadMany`, and expired cleanup 
uses `DeleteIfMatches` so cleanup does not remove a refreshed cooldown pair.
   
   Endpoint updates replace the endpoint object in config/runtime snapshots 
instead of mutating the old endpoint pointer in place.
   
   ### Success Criteria
   
   - `PickEndpoint()` uses runtime healthy endpoint snapshots.
   - `GetEndpointByID()` uses O(1) healthy endpoint lookup.
   - Built-in load balancers can consume pre-filtered healthy endpoint views.
   - Health-check transitions update runtime snapshots instead of mutating 
endpoint config in runtime clusters.
   - Stale health events from old endpoint addresses are ignored.
   - Snapshot refresh cannot overwrite newer concurrent health updates.
   - Rebuilt runtime clusters inherit old health/runtime state before 
replacement health checks start.
   - Runtime endpoint state is preserved only for the same endpoint ID and 
address.
   - LLM fallback cooldown no longer writes to `Endpoint.Metadata`.
   - LLM fallback does not route to snapshot-unhealthy fallback endpoints.
   - Snapshot endpoint slices cannot be mutated by callers to corrupt published 
snapshots.
   - Legacy `LoadBalancer` implementations remain source-compatible.
   - Direct `CreateHealthCheck()` callers keep legacy `Endpoint.UnHealthy` 
mutation behavior.
   
   ### Benchmark Evidence
   
   Environment:
   - `goos=darwin`
   - `goarch=arm64`
   - `cpu=Apple M5`
   
   The full Issue #905 baseline was established earlier to guide the whole 
optimization sequence. Since Step 4 has already been merged into upstream, this 
PR uses a fresh Step 4-based baseline for Step 5 attribution.
   
   Baseline source:
   - `upstream/develop`
   - commit `4ceeb323`
   - Step 4 merged, Step 5 absent
   
   Only healthy filtering and simple load-balancer hot-path benchmarks are used 
here. Lookup improvements belong to Step 4, and consistent-hash improvements 
belong to the final consistent-hash step.
   
   Benchmark command:
   
   ```bash
   go test ./pkg/server -run '^$' -bench 
'BenchmarkCluster(LoadBalancerHotPathSerial|Healthy)' -benchmem -count=5
   ```
   
   Each `Mean` value below is the arithmetic mean across 5 runs. `Range` shows 
the min-max span across the same 5 runs.
   
   #### Simple LB Hot Path
   
   | Benchmark | Before mean ns/op | Before range | After mean ns/op | After 
range | B/op before -> after | allocs/op before -> after |
   |---|---:|---:|---:|---:|---:|---:|
   | Rand / `endpoints=4` | 13.44 | 13.25-13.79 | 25.39 | 25.11-25.64 | 0 -> 32 
| 0 -> 1 |
   | Rand / `endpoints=64` | 133.30 | 131.40-134.70 | 92.90 | 92.30-93.32 | 512 
-> 512 | 1 -> 1 |
   | Rand / `endpoints=512` | 903.06 | 880.90-916.20 | 600.92 | 596.60-603.20 | 
4864 -> 4864 | 1 -> 1 |
   | RoundRobin / `endpoints=4` | 10.07 | 9.75-10.39 | 22.15 | 21.66-22.57 | 0 
-> 32 | 0 -> 1 |
   | RoundRobin / `endpoints=64` | 122.64 | 122.20-123.20 | 90.95 | 90.46-91.71 
| 512 -> 512 | 1 -> 1 |
   | RoundRobin / `endpoints=512` | 838.56 | 835.30-844.40 | 618.54 | 
602.10-635.00 | 4864 -> 4864 | 1 -> 1 |
   
   #### Healthy View
   
   Before uses the old `ClusterConfig.GetEndpoint(true)` healthy filter 
benchmark: `BenchmarkClusterHealthyFilterCost`.
   
   After uses the runtime snapshot healthy view benchmark: 
`BenchmarkClusterHealthySnapshotLoad`.
   
   | Benchmark | Before mean ns/op | Before range | After mean ns/op | After 
range | B/op before -> after | allocs/op before -> after |
   |---|---:|---:|---:|---:|---:|---:|
   | `endpoints=8, healthy=100%` | 21.94 | 21.74-22.19 | 17.69 | 17.58-17.85 | 
64 -> 64 | 1 -> 1 |
   | `endpoints=8, healthy=50%` | 19.94 | 19.86-20.05 | 13.92 | 13.74-14.05 | 
64 -> 32 | 1 -> 1 |
   | `endpoints=8, healthy=0%` | 16.96 | 16.90-17.07 | 2.91 | 2.87-2.96 | 64 -> 
0 | 1 -> 0 |
   | `endpoints=64, healthy=100%` | 121.04 | 115.60-124.20 | 86.16 | 
84.52-87.05 | 512 -> 512 | 1 -> 1 |
   | `endpoints=64, healthy=50%` | 104.74 | 103.60-105.40 | 45.59 | 45.50-45.67 
| 512 -> 256 | 1 -> 1 |
   | `endpoints=64, healthy=0%` | 84.77 | 84.16-85.03 | 2.97 | 2.94-3.04 | 512 
-> 0 | 1 -> 0 |
   | `endpoints=512, healthy=100%` | 907.90 | 875.50-939.10 | 603.16 | 
592.50-609.90 | 4864 -> 4864 | 1 -> 1 |
   | `endpoints=512, healthy=50%` | 726.64 | 693.90-757.50 | 312.72 | 
310.90-313.90 | 4864 -> 2304 | 1 -> 1 |
   | `endpoints=512, healthy=0%` | 484.54 | 483.20-486.40 | 2.98 | 2.93-3.02 | 
4864 -> 0 | 1 -> 0 |
   
   This PR removes the old request-path dependency on reading mutable endpoint 
health from `Endpoint.UnHealthy` and repeatedly filtering through 
`ClusterConfig.GetEndpoint(true)`.
   
   Current snapshot accessors intentionally return defensive shallow copies to 
protect published snapshot immutability. Because of that, this PR does not 
claim zero-allocation simple load-balancer hot-path results. The small 
`endpoints=4` simple LB path is slower in this revision due to the defensive 
copy; optimizing that path remains Step 6.
   
   ### Tests
   
   Targeted coverage includes:
   - snapshot seeding, defensive slices, and nil-safe accessors
   - runtime health updates without mutating `Endpoint.UnHealthy`
   - stale health events from replaced endpoint addresses
   - refresh versus concurrent health update ordering
   - runtime replacement snapshot inheritance and health-event freezing
   - runtime state inheritance/drop behavior across same-address, 
moved-address, and deleted endpoints
   - LLM cooldown state stored outside `Endpoint.Metadata`
   - pair-oriented cooldown read/write/cleanup behavior
   - `PickEndpoint()` and `GetEndpointByID()` reading healthy snapshots
   - degraded multi-endpoint load balancing with only one healthy endpoint
   - `PickNextEndpoint()` skipping snapshot-unhealthy fallback endpoints
   - legacy load-balancer healthy endpoint adaptation, cursor reconciliation, 
and serialized compatibility path
   - direct `CreateHealthCheck()` compatibility without a callback
   - race coverage for concurrent endpoint picks and health updates
   
   Validated with:
   
   ```bash
   go test -count=1 ./pkg/cluster ./pkg/cluster/loadbalancer 
./pkg/filter/llm/proxy ./pkg/server
   go test -race -count=1 ./pkg/cluster ./pkg/cluster/loadbalancer 
./pkg/filter/llm/proxy ./pkg/server
   git diff --check
   ```
   
   ### Notes
   
   `Endpoint.UnHealthy` remains for config compatibility and as the seed state 
when snapshots are built.
   
   `EndpointRuntimeState` is runtime-only and is not serialized into endpoint 
config.
   
   Snapshot endpoint slices are immutable after publication, but endpoint 
pointers remain shared for source compatibility. Request-path mutable state 
must stay in runtime state, not endpoint fields or metadata.
   
   Out-of-tree load balancers remain source-compatible through the existing 
`LoadBalancer` interface. New balancers that want the current healthy endpoint 
view should implement `SnapshotLoadBalancer`.
   
   The legacy load-balancer mutex only applies to old `LoadBalancer` 
implementations. Snapshot-aware built-in balancers use the snapshot path 
directly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to