mochengqian opened a new issue, #958:
URL: https://github.com/apache/dubbo-go-pixiu/issues/958
## Background
Follow-up to issue #905 and PR #932.
PR #932 (issue #905 step 5) introduced `EndpointSnapshot` and completed
**state isolation**: runtime health changes and membership updates now go
through immutable snapshots published via CAS, and the request path no longer
mutates `Endpoint.UnHealthy` or reads stale `Config.Endpoints`.
However, **structural coupling** between configuration and runtime remains.
`Cluster.Config` is still a public field; `ClusterStore.Config` and
`clustersMap[name].Config` share the same pointer; and runtime state (e.g.
`PrePickEndpointIndex`) lives on the configuration struct. PR #932 explicitly
scoped these out:
> Out of scope (tracked as tech debt): `Cluster.Config` remains public in
this PR. Making it private + adding getters touches ~20 callsites and belongs
in the decoupling series after PR-3 lands; target: decoupling PR-6, together
with runtime ownership cleanup.
This issue tracks that decoupling series.
## Current coupling points
1. **`Cluster.Config` is a public field** (`pkg/cluster/cluster.go:40`) —
external code can directly read/write `ClusterConfig`, bypassing any
encapsulation or validation. No getter/setter boundary.
2. **`ClusterStore.Config` and runtime `Cluster.Config` share the same
pointer** — `replaceClusterRuntime` assigns `config` directly to the new
cluster (`pkg/server/cluster_manager.go:448`). Mutations to `store.Config[i]`
are visible to the runtime cluster, and vice versa. The identity check
`runtimeCluster.Config != clusterConfig` (line 475) relies on pointer equality,
which is fragile under deep-copy or reload scenarios.
3. **RoundRobin cursor lives on `ClusterConfig.PrePickEndpointIndex`** —
configuration structs represent desired state; runtime pick cursors are
operational state. The two should not share storage. The current design forces
legacy and snapshot RoundRobin paths to contend on the same atomic counter (see
`pkg/cluster/loadbalancer/load_balancer.go` RR fairness caveat comment),
creating a subtle fairness skew risk.
4. **`prepareClusterConfig` eagerly rebuilds `Config.ConsistentHash.Hash`**
on every endpoint mutation (`pkg/server/cluster_manager.go:362`), even though
snapshot-aware balancers read `EndpointSnapshot.HealthyConsistentHash` and
never touch `Config.ConsistentHash.Hash`. This is wasted work and should be
deferred to the legacy pick path (tracked separately in #957, but the root
cause is that config and runtime hash storage are conflated).
5. **Ownership boundaries unclear** — when `ClusterStore` calls
`replaceClusterRuntime`, the old runtime is returned and stopped by the caller.
But there's no clear contract on who owns the `ClusterConfig` lifecycle, or
when it's safe to mutate `Config.Endpoints` vs. when the runtime has published
a snapshot from it.
## Goal
Establish a clean separation:
- **Configuration layer** (`ClusterStore.Config`, `ClusterConfig`)
represents desired state. Immutable after publication to runtime. External
mutations go through `ClusterManager` APIs that trigger controlled runtime
updates.
- **Runtime layer** (`clustersMap[name]`, `Cluster`, `EndpointSnapshot`)
represents operational state. Holds its own copy (or read-only view) of config.
Runtime state (cursors, health, snapshots) never leaks into config structs.
- **Clear ownership** — who owns each object, when it's safe to mutate, and
how lifecycle (start/stop/replace) is managed.
## Suggested approach (sequenced steps)
This is a **series**, not a single PR. Each step should be independently
reviewable and have clear before/after semantics.
### Step 1: Make `Cluster.Config` private + add getters
- Rename `Cluster.Config` → `Cluster.config` (unexported).
- Add `func (c *Cluster) Config() *model.ClusterConfig` that returns a
**read-only view** (either a shallow copy with critical fields frozen, or a
wrapper that panics on write — TBD during implementation).
- Audit all ~20 external read sites and replace `cluster.Config.X` with
`cluster.Config().X`.
- For the few write sites (if any exist outside `ClusterManager`), route
them through explicit `ClusterManager` mutation APIs or flag them as bugs.
**Goal:** No external code can directly mutate `Cluster.config` after it's
constructed.
### Step 2: `replaceClusterRuntime` deep-clones `ClusterConfig`
- Instead of `s.clustersMap[name] =
cluster.NewClusterWithEndpointSnapshot(config, previous)` (shared pointer),
deep-clone `config` first:
```go
runtimeConfig := model.CloneClusterConfig(config) // new helper
s.clustersMap[name] =
cluster.NewClusterWithEndpointSnapshot(runtimeConfig, previous)
```
- The runtime cluster now owns its config copy. Mutations to
`store.Config[i]` no longer affect the running cluster until the next
`replaceClusterRuntime`.
- The identity check `runtimeCluster.Config != clusterConfig` becomes a
generation/version check instead of pointer equality (consider adding
`ClusterConfig.Generation int64` or similar).
**Goal:** `store.Config` and `clustersMap[name].Config()` point to
independent objects.
### Step 3: Move RoundRobin cursor out of `ClusterConfig`
- Add a `RuntimeState` struct (or similar) that holds operational counters:
```go
type RuntimeState struct {
roundRobinCursor uint32
}
```
- `Cluster` holds `runtimeState *RuntimeState`.
- `PickContext` carries `runtimeState` alongside `Config` and `Snapshot`.
- RoundRobin reads/writes the cursor from `runtimeState`, not `Config`.
- Legacy RoundRobin path and snapshot RoundRobin path now have **independent
cursors** (or share one in `runtimeState` if that's the desired semantic — but
either way, it's not on the config struct).
**Goal:** `ClusterConfig` is pure desired-state data. No runtime operational
fields.
### Step 4: Audit and document ownership boundaries
- Write down the lifecycle contract:
- Who creates `ClusterConfig`? (Answer: user yaml / xDS /
`ClusterManager.AddCluster`)
- When is it safe to mutate? (Answer: only before passing to `AddCluster`
/ `UpdateCluster`, or inside `prepareClusterConfig` before runtime publication)
- Who owns runtime `Cluster`? (Answer: `ClusterStore.clustersMap`,
lifecycle managed by `replaceClusterRuntime` / `stopClusters`)
- When is a `Cluster` stopped? (Answer: on `UpdateCluster` /
`DeleteCluster` / `CompareAndSetStore`)
- Add godoc on `ClusterConfig`, `Cluster`, `ClusterStore` summarizing the
contract.
- Add a test that tries to mutate `store.Config[i].Endpoints` after
`AddCluster` and asserts the runtime cluster doesn't see it (proving the clone).
**Goal:** Clear ownership rules documented and enforced by tests.
### Step 5 (optional stretch): Immutable `ClusterConfig` with builder pattern
If steps 1-4 reveal lots of subtle mutation sites, consider:
- `ClusterConfig` becomes append-only (all fields have setters that return a
new copy).
- Or use a builder pattern (`ClusterConfigBuilder`) that produces immutable
`ClusterConfig` on `.Build()`.
This is a bigger refactor and may not be worth it if steps 1-4 already
establish clear boundaries. Defer unless mutation bugs keep surfacing.
## Acceptance criteria
- `Cluster.Config` is unexported; external reads go through a getter that
returns a read-only view.
- `store.Config[i]` and `clustersMap[name].Config()` point to independent
objects (verified by pointer inequality test).
- RoundRobin cursor is in runtime state, not in `ClusterConfig`.
- Ownership contract is documented in godoc and enforced by tests.
- Mutating `store.Config[i]` after `AddCluster` / `UpdateCluster` has no
effect on the running cluster until the next explicit update API call.
## Relationship to other follow-ups
- **Independent of** #938-950, #955-957 (those are perf / API cleanup; this
is structural)
- **Complements** #957 (lazy `Config.ConsistentHash`) — once config and
runtime are separate, it's clearer that `Config.ConsistentHash.Hash` is
legacy-only
## Files
- `pkg/cluster/cluster.go`
- `pkg/server/cluster_manager.go`
- `pkg/model/cluster.go`
- `pkg/cluster/loadbalancer/roundrobin/round_robin.go` (step 3)
- `pkg/cluster/loadbalancer/load_balancer.go` (step 3)
## Priority
Medium. State isolation is already complete (#932), so this is not a
correctness blocker. But it's foundational for:
- Making the cluster runtime easier to reason about (clear mutation
boundaries)
- Enabling future work like dynamic reload without restart, or separating
control-plane config from data-plane runtime
- Reducing the risk of accidental config mutation bugs
Suggest tackling this after the current perf follow-ups (#955-957) land, so
the architecture is stable before the next round of optimization.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]