mochengqian opened a new pull request, #925: URL: https://github.com/apache/dubbo-go-pixiu/pull/925
Implements Issue #905 Step 5: introduce runtime healthy endpoint snapshots. This PR moves request-path endpoint health reads from mutable `Endpoint.UnHealthy` filtering to runtime-published endpoint snapshots. Runtime clusters now publish immutable snapshot views for all endpoints, healthy endpoints, endpoint lookup by ID, healthy lookup by ID, and per-endpoint runtime state. This PR also fixes boundary cases required by the snapshot transition: - health checks publish ID/address-scoped health events instead of mutating endpoint config in runtime clusters - stale health events from replaced endpoint addresses are ignored - endpoint refresh/rebuild preserves health and runtime state only for the same endpoint ID and address - snapshot accessors return defensive endpoint slices - LLM fallback cooldown state moves from `Endpoint.Metadata` to runtime endpoint state - legacy `LoadBalancer` implementations remain source-compatible through an adapter path Issue #905 step progress: - [x] Add tests and benchmarks - [x] Fix current correctness issues - [x] Tighten runtime consistency - [x] Switch cluster lookup to O(1) - [x] Introduce healthy endpoint snapshots - [ ] Optimize simple LB hot path - [ ] Optimize consistent-hash LB last ### Why Before this PR, `Endpoint.UnHealthy` served two roles: 1. Config seed state when endpoints enter a cluster. 2. Mutable runtime health state updated by health checks. That made request paths depend on mutable endpoint config and repeatedly filter healthy endpoints through `ClusterConfig.GetEndpoint(true)`. LLM fallback had a similar boundary problem: fallback cooldown state was stored in `Endpoint.Metadata`, which is endpoint config shared through snapshots. Concurrent request paths could therefore mutate shared endpoint config while runtime snapshots exposed the same endpoint pointers. This PR separates those concerns: - endpoint config remains the source of desired endpoint membership and initial seed state - runtime snapshots become the source of request-path health state - runtime-only mutable endpoint data lives in `EndpointRuntimeState` ### How Add `EndpointSnapshot` to runtime clusters and publish snapshots through `atomic.Pointer`. Snapshot views include: - all endpoints - healthy endpoints - endpoint lookup by ID - healthy endpoint lookup by ID - endpoint address by ID - runtime endpoint state by ID/address Health checks now use callback events carrying endpoint ID, endpoint address, and health state. Runtime clusters apply those events by replacing snapshot health views instead of mutating `Endpoint.UnHealthy`. Direct `CreateHealthCheck()` callers keep legacy behavior: when no callback is installed, health checks still update `Endpoint.UnHealthy`. Endpoint refresh is CAS-based. If another health update wins the race, refresh rebuilds from the latest snapshot instead of overwriting newer health state. Runtime replacement freezes old health events with `SnapshotForRuntimeReplacement()` and constructs the replacement cluster with the inherited snapshot before new health checks start. Runtime state inheritance is guarded by endpoint ID and address: - same ID and same address: preserve runtime health/state - same ID but new address: drop old runtime state - deleted endpoint: drop old runtime state `PickEndpoint()` now reads the runtime healthy snapshot. The single-endpoint fast path only applies when the configured runtime snapshot has exactly one endpoint; degraded multi-endpoint clusters still invoke the configured load balancer. `PickNextEndpoint()` preserves configured endpoint order but skips endpoints that are unhealthy in the current snapshot. Built-in load balancers implement an optional `SnapshotLoadBalancer` path. The exported `LoadBalancer` interface remains unchanged for out-of-tree balancers. Legacy load balancers are adapted by passing a shallow copied config with the current healthy endpoint view. The adapter serializes legacy picks and reconciles `PrePickEndpointIndex` back to the real cluster config for cursor-style compatibility. LLM fallback cooldown state now uses address-guarded `EndpointRuntimeState`. Cooldown pair writes use `StoreMany`, reads use `LoadMany`, and expired cleanup uses `DeleteIfMatches` so cleanup does not remove a refreshed cooldown pair. Endpoint updates replace the endpoint object in config/runtime snapshots instead of mutating the old endpoint pointer in place. ### Success Criteria - `PickEndpoint()` uses runtime healthy endpoint snapshots. - `GetEndpointByID()` uses O(1) healthy endpoint lookup. - Built-in load balancers can consume pre-filtered healthy endpoint views. - Health-check transitions update runtime snapshots instead of mutating endpoint config in runtime clusters. - Stale health events from old endpoint addresses are ignored. - Snapshot refresh cannot overwrite newer concurrent health updates. - Rebuilt runtime clusters inherit old health/runtime state before replacement health checks start. - Runtime endpoint state is preserved only for the same endpoint ID and address. - LLM fallback cooldown no longer writes to `Endpoint.Metadata`. - LLM fallback does not route to snapshot-unhealthy fallback endpoints. - Snapshot endpoint slices cannot be mutated by callers to corrupt published snapshots. - Legacy `LoadBalancer` implementations remain source-compatible. - Direct `CreateHealthCheck()` callers keep legacy `Endpoint.UnHealthy` mutation behavior. ### Benchmark Evidence Environment: - `goos=darwin` - `goarch=arm64` - `cpu=Apple M5` The full Issue #905 baseline was established earlier to guide the whole optimization sequence. Since Step 4 has already been merged into upstream, this PR uses a fresh Step 4-based baseline for Step 5 attribution. Baseline source: - `upstream/develop` - commit `4ceeb323` - Step 4 merged, Step 5 absent Only healthy filtering and simple load-balancer hot-path benchmarks are used here. Lookup improvements belong to Step 4, and consistent-hash improvements belong to the final consistent-hash step. Benchmark command: ```bash go test ./pkg/server -run '^$' -bench 'BenchmarkCluster(LoadBalancerHotPathSerial|Healthy)' -benchmem -count=5 ``` Each `Mean` value below is the arithmetic mean across 5 runs. `Range` shows the min-max span across the same 5 runs. #### Simple LB Hot Path | Benchmark | Before mean ns/op | Before range | After mean ns/op | After range | B/op before -> after | allocs/op before -> after | |---|---:|---:|---:|---:|---:|---:| | Rand / `endpoints=4` | 13.44 | 13.25-13.79 | 25.39 | 25.11-25.64 | 0 -> 32 | 0 -> 1 | | Rand / `endpoints=64` | 133.30 | 131.40-134.70 | 92.90 | 92.30-93.32 | 512 -> 512 | 1 -> 1 | | Rand / `endpoints=512` | 903.06 | 880.90-916.20 | 600.92 | 596.60-603.20 | 4864 -> 4864 | 1 -> 1 | | RoundRobin / `endpoints=4` | 10.07 | 9.75-10.39 | 22.15 | 21.66-22.57 | 0 -> 32 | 0 -> 1 | | RoundRobin / `endpoints=64` | 122.64 | 122.20-123.20 | 90.95 | 90.46-91.71 | 512 -> 512 | 1 -> 1 | | RoundRobin / `endpoints=512` | 838.56 | 835.30-844.40 | 618.54 | 602.10-635.00 | 4864 -> 4864 | 1 -> 1 | #### Healthy View Before uses the old `ClusterConfig.GetEndpoint(true)` healthy filter benchmark: `BenchmarkClusterHealthyFilterCost`. After uses the runtime snapshot healthy view benchmark: `BenchmarkClusterHealthySnapshotLoad`. | Benchmark | Before mean ns/op | Before range | After mean ns/op | After range | B/op before -> after | allocs/op before -> after | |---|---:|---:|---:|---:|---:|---:| | `endpoints=8, healthy=100%` | 21.94 | 21.74-22.19 | 17.69 | 17.58-17.85 | 64 -> 64 | 1 -> 1 | | `endpoints=8, healthy=50%` | 19.94 | 19.86-20.05 | 13.92 | 13.74-14.05 | 64 -> 32 | 1 -> 1 | | `endpoints=8, healthy=0%` | 16.96 | 16.90-17.07 | 2.91 | 2.87-2.96 | 64 -> 0 | 1 -> 0 | | `endpoints=64, healthy=100%` | 121.04 | 115.60-124.20 | 86.16 | 84.52-87.05 | 512 -> 512 | 1 -> 1 | | `endpoints=64, healthy=50%` | 104.74 | 103.60-105.40 | 45.59 | 45.50-45.67 | 512 -> 256 | 1 -> 1 | | `endpoints=64, healthy=0%` | 84.77 | 84.16-85.03 | 2.97 | 2.94-3.04 | 512 -> 0 | 1 -> 0 | | `endpoints=512, healthy=100%` | 907.90 | 875.50-939.10 | 603.16 | 592.50-609.90 | 4864 -> 4864 | 1 -> 1 | | `endpoints=512, healthy=50%` | 726.64 | 693.90-757.50 | 312.72 | 310.90-313.90 | 4864 -> 2304 | 1 -> 1 | | `endpoints=512, healthy=0%` | 484.54 | 483.20-486.40 | 2.98 | 2.93-3.02 | 4864 -> 0 | 1 -> 0 | This PR removes the old request-path dependency on reading mutable endpoint health from `Endpoint.UnHealthy` and repeatedly filtering through `ClusterConfig.GetEndpoint(true)`. Current snapshot accessors intentionally return defensive shallow copies to protect published snapshot immutability. Because of that, this PR does not claim zero-allocation simple load-balancer hot-path results. The small `endpoints=4` simple LB path is slower in this revision due to the defensive copy; optimizing that path remains Step 6. ### Tests Targeted coverage includes: - snapshot seeding, defensive slices, and nil-safe accessors - runtime health updates without mutating `Endpoint.UnHealthy` - stale health events from replaced endpoint addresses - refresh versus concurrent health update ordering - runtime replacement snapshot inheritance and health-event freezing - runtime state inheritance/drop behavior across same-address, moved-address, and deleted endpoints - LLM cooldown state stored outside `Endpoint.Metadata` - pair-oriented cooldown read/write/cleanup behavior - `PickEndpoint()` and `GetEndpointByID()` reading healthy snapshots - degraded multi-endpoint load balancing with only one healthy endpoint - `PickNextEndpoint()` skipping snapshot-unhealthy fallback endpoints - legacy load-balancer healthy endpoint adaptation, cursor reconciliation, and serialized compatibility path - direct `CreateHealthCheck()` compatibility without a callback - race coverage for concurrent endpoint picks and health updates Validated with: ```bash go test -count=1 ./pkg/cluster ./pkg/cluster/loadbalancer ./pkg/filter/llm/proxy ./pkg/server go test -race -count=1 ./pkg/cluster ./pkg/cluster/loadbalancer ./pkg/filter/llm/proxy ./pkg/server git diff --check ``` ### Notes `Endpoint.UnHealthy` remains for config compatibility and as the seed state when snapshots are built. `EndpointRuntimeState` is runtime-only and is not serialized into endpoint config. Snapshot endpoint slices are immutable after publication, but endpoint pointers remain shared for source compatibility. Request-path mutable state must stay in runtime state, not endpoint fields or metadata. Out-of-tree load balancers remain source-compatible through the existing `LoadBalancer` interface. New balancers that want the current healthy endpoint view should implement `SnapshotLoadBalancer`. The legacy load-balancer mutex only applies to old `LoadBalancer` implementations. Snapshot-aware built-in balancers use the snapshot path directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
