mochengqian opened a new pull request, #930:
URL: https://github.com/apache/dubbo-go-pixiu/pull/930
❗️Breaking change
### Description
Replaces the random-UUID endpoint identity used by the Nacos LLM registry
and the static cluster-config bootstrap with a deterministic SHA-256-based
scheme, so `endpoint.ID` is stable across process restarts and Nacos
re-subscriptions. Two related bugs in the same Nacos function — `Port`
being overwritten with 0 when `metadata["port"]` fails to parse, and
`Address` not falling back to `instance.Ip` / `instance.Port` — are fixed
along the way.
### Why
`endpoint.ID` is the index key for many runtime and observability paths.
With a random-UUID fallback, every process restart (and every Nacos
re-subscription) produces a fresh ID for the same physical endpoint:
| Path | What
happens after restart |
| ------------------------------------------------------------------- |
-----------------------------------------------------------------------------------
|
| Prometheus metrics with `endpoint` label |
label values change; each endpoint becomes multiple broken lines in Grafana
|
| Logs correlated by `endpoint.ID` | pre-
and post-restart logs no longer link, SRE loses the thread on every restart |
| `ClusterManager.PickNextEndpoint(curEndpointID)` | uses
`endpoint.ID` as a retry cursor — every in-flight retry chain drops on restart |
| LLM health-check / circuit-breaker state (indexed by `endpoint.ID`) |
treated as a brand-new endpoint, health and breaker state reset to zero
|
This is the kind of bug that doesn't surface in functional tests
(service runs, requests succeed) but bleeds operationally — operators
see endpoints "mysteriously disappearing and reappearing" in dashboards
and end up patching by manually setting `metadata["id"]` on every Nacos
instance. That patches the symptom. This PR removes the cause.
### Root Cause
The relevant `uuid.GenerateUUID()` fallbacks (pre-PR):
```go
// pkg/adapter/llmregistry/registry/nacos/listener.go (before)
if id, ok := instance.Metadata["id"]; ok {
ret.ID = id
} else {
ret.ID, _ = uuid.GenerateUUID() // fresh random UUID every restart
}
```
```go
// pkg/server/cluster_manager.go (before)
if endpoint.ID == "" {
endpoint.ID, _ = uuid.GenerateUUID() // same problem in the
static-config path
}
```
Neither call site consulted `instance.InstanceId` (Nacos already
provides a stable identifier), and neither used any content-addressable
hash of the endpoint's defining attributes. In the Nacos path the
problem also fires on every re-subscription, not only on restart,
because each callback re-derives `endpoint.ID` from scratch.
Two unrelated bugs sit in the same `generateEndpoint` block:
- `metadata["port"]` parse failure logs a warning but still writes the
parser's zero result into `Address.Port`, silently zeroing a valid port.
- `Address.Address` / `Address.Port` are populated only from
`metadata["ip"]` / `metadata["port"]`, so instances that publish only
`instance.Ip` / `instance.Port` (no metadata overrides) produce
endpoints with an empty address.
### PR Goal
Make endpoint identity a function of the endpoint's defining attributes
— cluster name, address, provider, API key — instead of a function of
when the process happened to start. Keep behavior consistent between the
Nacos LLM registry path and the static cluster-config bootstrap path so
operators see the same identity rules regardless of registry source.
### Success Criteria
- A Nacos instance carrying `metadata["id"]` continues to use that ID.
- A Nacos instance carrying `InstanceId` but no `metadata["id"]` uses
`InstanceId` (one upgrade-time shift, then stable forever).
- A Nacos instance carrying neither produces `generated-<sha8>` derived
from `(metadata["cluster"], address, provider, api_key)`, identical
across restarts and re-subscriptions.
- A static-config endpoint with empty `id:` produces the same
deterministic identifier on every boot.
- Renaming an endpoint (`metadata["name"]`) does not change its ID.
- Two endpoints differing only by credential do not collide.
- The raw API key never appears in the generated ID (SHA-256 is one-way).
- `metadata["port"]` parse error preserves the baseline `instance.Port`.
- `Address` baselines from `instance.Ip` / `instance.Port` before
metadata overrides.
### Upgrade-visible behavior change ⚠
| Source | Before | After
|
| ------------------------------------- | ---------------- |
------------------ |
| Nacos instance with `metadata["id"]` | `metadata["id"]` | unchanged
|
| Nacos instance with `InstanceId` only | random UUID | `InstanceId`
|
| Nacos instance without either | random UUID |
`generated-<sha8>` |
| Static config endpoint with empty ID | random UUID |
`generated-<sha8>` |
Operators must rebuild any dashboards, log queries, or runbooks keyed on
`endpoint.ID`. On the first deploy after upgrade, downstream consumers
of `OnAddEndpoint` / `OnRemoveEndpoint` will see one round of
remove-then-add per endpoint whose ID changes shape.
There is no runtime opt-out for this behavior change — the migration is
one-shot and reviewers should confirm this is acceptable for their
deployment before merging.
### Known limitations
- **Static endpoint collapse.** Two static endpoints with empty `id:`
that share the same `(cluster_name, address, provider, api_key)` now
collapse to a single identifier. They were already indistinguishable
to the picker; the previous random IDs offered no real separation.
`assembleClusterEndpoints` logs a `warn` naming both slice positions
and the shared ID when this happens, so the misconfiguration is
visible. Set an explicit `id:` to keep them distinct.
- **Nacos instances without `metadata["cluster"]`.** The cluster-name
component falls back to the empty string, so endpoints across
different Nacos services sharing `(address, provider, api_key)` will
collide. Set `metadata.cluster` on the Nacos side to prevent this.
### Paths this PR should affect
- `pkg/model/cluster.go`
- `pkg/model/cluster_test.go`
- `pkg/adapter/llmregistry/registry/nacos/listener.go`
- `pkg/adapter/llmregistry/registry/nacos/listener_test.go`
- `pkg/server/cluster_manager.go`
- `pkg/server/cluster_manager_test.go`
- `go.mod` (drops `hashicorp/go-uuid` from direct dependencies)
### Revise
- **Adding `model.GenerateEndpointID(clusterName, endpoint)`**: returns
`"generated-<8-byte-hex>"` where the 8 bytes are the leading bytes of
`sha256(clusterName ‖ address ‖ provider ‖ apiKey)`. Documented design
notes cover the 64-bit truncation rationale, the empty-clusterName
edge case, and the deliberate inclusion of the API key.
- **Changing Nacos `generateEndpoint`**: three-tier identity lookup —
`metadata["id"]` → `instance.InstanceId` → `GenerateEndpointID`. Stops
zeroing `Address.Port` on `metadata["port"]` parse failure. Seeds
`Address.Address` / `Address.Port` from `instance.Ip` /
`instance.Port` before metadata overrides apply.
- **Changing `assembleClusterEndpoints`**: empty `endpoint.ID` is filled
by `GenerateEndpointID(cluster.Name, endpoint)` instead of
`uuid.GenerateUUID()`. Updates the comment that incorrectly claimed
the ID was the endpoint index. Logs a `warn` (with both slice
positions and the shared ID) whenever two endpoints in the same
cluster end up with the same generated ID, so the documented collapse
behavior is loud rather than silent.
- **Hardening `endpointIDMaterial` contract**: the helper now carries a
comment stating it MUST NOT depend on `endpoint.Name`, because two
call sites rely on that — `assembleClusterEndpoints` derives the ID
before assigning a default Name, and the rename-invariance test
enforces it.
- **Adding regression tests**: deterministic-across-calls, rename
invariance, credential-difference distinction, address-difference
distinction, cluster-difference distinction, raw-key non-leakage,
Nacos `InstanceId` fallback, Nacos generated-hash fallback,
cross-rename ID stability, static-config endpoint collapse,
explicit-ID preservation.
- **Removing `hashicorp/go-uuid`** from direct dependencies in `go.mod`;
it remains transitively pulled in.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]