mochengqian opened a new pull request, #930:
URL: https://github.com/apache/dubbo-go-pixiu/pull/930

   ❗️Breaking change
   ### Description
   Replaces the random-UUID endpoint identity used by the Nacos LLM registry
   and the static cluster-config bootstrap with a deterministic SHA-256-based
   scheme, so `endpoint.ID` is stable across process restarts and Nacos
   re-subscriptions. Two related bugs in the same Nacos function — `Port`
   being overwritten with 0 when `metadata["port"]` fails to parse, and
   `Address` not falling back to `instance.Ip` / `instance.Port` — are fixed
   along the way.
   
   ### Why
   
   `endpoint.ID` is the index key for many runtime and observability paths.
   With a random-UUID fallback, every process restart (and every Nacos
   re-subscription) produces a fresh ID for the same physical endpoint:
   
   | Path                                                                | What 
happens after restart                                                          |
   | ------------------------------------------------------------------- | 
-----------------------------------------------------------------------------------
 |
   | Prometheus metrics with `endpoint` label                            | 
label values change; each endpoint becomes multiple broken lines in Grafana     
    |
   | Logs correlated by `endpoint.ID`                                    | pre- 
and post-restart logs no longer link, SRE loses the thread on every restart    |
   | `ClusterManager.PickNextEndpoint(curEndpointID)`                    | uses 
`endpoint.ID` as a retry cursor — every in-flight retry chain drops on restart |
   | LLM health-check / circuit-breaker state (indexed by `endpoint.ID`) | 
treated as a brand-new endpoint, health and breaker state reset to zero         
    |
   
   This is the kind of bug that doesn't surface in functional tests
   (service runs, requests succeed) but bleeds operationally — operators
   see endpoints "mysteriously disappearing and reappearing" in dashboards
   and end up patching by manually setting `metadata["id"]` on every Nacos
   instance. That patches the symptom. This PR removes the cause.
   
   ### Root Cause
   
   The relevant `uuid.GenerateUUID()` fallbacks (pre-PR):
   
   ```go
   // pkg/adapter/llmregistry/registry/nacos/listener.go (before)
   if id, ok := instance.Metadata["id"]; ok {
       ret.ID = id
   } else {
       ret.ID, _ = uuid.GenerateUUID()   // fresh random UUID every restart
   }
   ```
   
   ```go
   // pkg/server/cluster_manager.go (before)
   if endpoint.ID == "" {
       endpoint.ID, _ = uuid.GenerateUUID()   // same problem in the 
static-config path
   }
   ```
   
   Neither call site consulted `instance.InstanceId` (Nacos already
   provides a stable identifier), and neither used any content-addressable
   hash of the endpoint's defining attributes. In the Nacos path the
   problem also fires on every re-subscription, not only on restart,
   because each callback re-derives `endpoint.ID` from scratch.
   
   Two unrelated bugs sit in the same `generateEndpoint` block:
   
   - `metadata["port"]` parse failure logs a warning but still writes the
     parser's zero result into `Address.Port`, silently zeroing a valid port.
   - `Address.Address` / `Address.Port` are populated only from
     `metadata["ip"]` / `metadata["port"]`, so instances that publish only
     `instance.Ip` / `instance.Port` (no metadata overrides) produce
     endpoints with an empty address.
   
   ### PR Goal
   
   Make endpoint identity a function of the endpoint's defining attributes
   — cluster name, address, provider, API key — instead of a function of
   when the process happened to start. Keep behavior consistent between the
   Nacos LLM registry path and the static cluster-config bootstrap path so
   operators see the same identity rules regardless of registry source.
   
   ### Success Criteria
   
   - A Nacos instance carrying `metadata["id"]` continues to use that ID.
   - A Nacos instance carrying `InstanceId` but no `metadata["id"]` uses
     `InstanceId` (one upgrade-time shift, then stable forever).
   - A Nacos instance carrying neither produces `generated-<sha8>` derived
     from `(metadata["cluster"], address, provider, api_key)`, identical
     across restarts and re-subscriptions.
   - A static-config endpoint with empty `id:` produces the same
     deterministic identifier on every boot.
   - Renaming an endpoint (`metadata["name"]`) does not change its ID.
   - Two endpoints differing only by credential do not collide.
   - The raw API key never appears in the generated ID (SHA-256 is one-way).
   - `metadata["port"]` parse error preserves the baseline `instance.Port`.
   - `Address` baselines from `instance.Ip` / `instance.Port` before
     metadata overrides.
   
   ### Upgrade-visible behavior change ⚠
   
   | Source                                | Before           | After           
   |
   | ------------------------------------- | ---------------- | 
------------------ |
   | Nacos instance with `metadata["id"]`  | `metadata["id"]` | unchanged       
   |
   | Nacos instance with `InstanceId` only | random UUID      | `InstanceId`    
   |
   | Nacos instance without either         | random UUID      | 
`generated-<sha8>` |
   | Static config endpoint with empty ID  | random UUID      | 
`generated-<sha8>` |
   
   Operators must rebuild any dashboards, log queries, or runbooks keyed on
   `endpoint.ID`. On the first deploy after upgrade, downstream consumers
   of `OnAddEndpoint` / `OnRemoveEndpoint` will see one round of
   remove-then-add per endpoint whose ID changes shape.
   
   There is no runtime opt-out for this behavior change — the migration is
   one-shot and reviewers should confirm this is acceptable for their
   deployment before merging.
   
   ### Known limitations
   
   - **Static endpoint collapse.** Two static endpoints with empty `id:`
     that share the same `(cluster_name, address, provider, api_key)` now
     collapse to a single identifier. They were already indistinguishable
     to the picker; the previous random IDs offered no real separation.
     `assembleClusterEndpoints` logs a `warn` naming both slice positions
     and the shared ID when this happens, so the misconfiguration is
     visible. Set an explicit `id:` to keep them distinct.
   - **Nacos instances without `metadata["cluster"]`.** The cluster-name
     component falls back to the empty string, so endpoints across
     different Nacos services sharing `(address, provider, api_key)` will
     collide. Set `metadata.cluster` on the Nacos side to prevent this.
   
   ### Paths this PR should affect
   
   - `pkg/model/cluster.go`
   - `pkg/model/cluster_test.go`
   - `pkg/adapter/llmregistry/registry/nacos/listener.go`
   - `pkg/adapter/llmregistry/registry/nacos/listener_test.go`
   - `pkg/server/cluster_manager.go`
   - `pkg/server/cluster_manager_test.go`
   - `go.mod` (drops `hashicorp/go-uuid` from direct dependencies)
   
   ### Revise
   
   - **Adding `model.GenerateEndpointID(clusterName, endpoint)`**: returns
     `"generated-<8-byte-hex>"` where the 8 bytes are the leading bytes of
     `sha256(clusterName ‖ address ‖ provider ‖ apiKey)`. Documented design
     notes cover the 64-bit truncation rationale, the empty-clusterName
     edge case, and the deliberate inclusion of the API key.
   - **Changing Nacos `generateEndpoint`**: three-tier identity lookup —
     `metadata["id"]` → `instance.InstanceId` → `GenerateEndpointID`. Stops
     zeroing `Address.Port` on `metadata["port"]` parse failure. Seeds
     `Address.Address` / `Address.Port` from `instance.Ip` /
     `instance.Port` before metadata overrides apply.
   - **Changing `assembleClusterEndpoints`**: empty `endpoint.ID` is filled
     by `GenerateEndpointID(cluster.Name, endpoint)` instead of
     `uuid.GenerateUUID()`. Updates the comment that incorrectly claimed
     the ID was the endpoint index. Logs a `warn` (with both slice
     positions and the shared ID) whenever two endpoints in the same
     cluster end up with the same generated ID, so the documented collapse
     behavior is loud rather than silent.
   - **Hardening `endpointIDMaterial` contract**: the helper now carries a
     comment stating it MUST NOT depend on `endpoint.Name`, because two
     call sites rely on that — `assembleClusterEndpoints` derives the ID
     before assigning a default Name, and the rename-invariance test
     enforces it.
   - **Adding regression tests**: deterministic-across-calls, rename
     invariance, credential-difference distinction, address-difference
     distinction, cluster-difference distinction, raw-key non-leakage,
     Nacos `InstanceId` fallback, Nacos generated-hash fallback,
     cross-rename ID stability, static-config endpoint collapse,
     explicit-ID preservation.
   - **Removing `hashicorp/go-uuid`** from direct dependencies in `go.mod`;
     it remains transitively pulled in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to