mochengqian opened a new issue, #958:
URL: https://github.com/apache/dubbo-go-pixiu/issues/958

   ## Background
   
   Follow-up to issue #905 and PR #932.
   
   PR #932 (issue #905 step 5) introduced `EndpointSnapshot` and completed 
**state isolation**: runtime health changes and membership updates now go 
through immutable snapshots published via CAS, and the request path no longer 
mutates `Endpoint.UnHealthy` or reads stale `Config.Endpoints`.
   
   However, **structural coupling** between configuration and runtime remains. 
`Cluster.Config` is still a public field; `ClusterStore.Config` and 
`clustersMap[name].Config` share the same pointer; and runtime state (e.g. 
`PrePickEndpointIndex`) lives on the configuration struct. PR #932 explicitly 
scoped these out:
   
   > Out of scope (tracked as tech debt): `Cluster.Config` remains public in 
this PR. Making it private + adding getters touches ~20 callsites and belongs 
in the decoupling series after PR-3 lands; target: decoupling PR-6, together 
with runtime ownership cleanup.
   
   This issue tracks that decoupling series.
   
   ## Current coupling points
   
   1. **`Cluster.Config` is a public field** (`pkg/cluster/cluster.go:40`) — 
external code can directly read/write `ClusterConfig`, bypassing any 
encapsulation or validation. No getter/setter boundary.
   
   2. **`ClusterStore.Config` and runtime `Cluster.Config` share the same 
pointer** — `replaceClusterRuntime` assigns `config` directly to the new 
cluster (`pkg/server/cluster_manager.go:448`). Mutations to `store.Config[i]` 
are visible to the runtime cluster, and vice versa. The identity check 
`runtimeCluster.Config != clusterConfig` (line 475) relies on pointer equality, 
which is fragile under deep-copy or reload scenarios.
   
   3. **RoundRobin cursor lives on `ClusterConfig.PrePickEndpointIndex`** — 
configuration structs represent desired state; runtime pick cursors are 
operational state. The two should not share storage. The current design forces 
legacy and snapshot RoundRobin paths to contend on the same atomic counter (see 
`pkg/cluster/loadbalancer/load_balancer.go` RR fairness caveat comment), 
creating a subtle fairness skew risk.
   
   4. **`prepareClusterConfig` eagerly rebuilds `Config.ConsistentHash.Hash`** 
on every endpoint mutation (`pkg/server/cluster_manager.go:362`), even though 
snapshot-aware balancers read `EndpointSnapshot.HealthyConsistentHash` and 
never touch `Config.ConsistentHash.Hash`. This is wasted work and should be 
deferred to the legacy pick path (tracked separately in #957, but the root 
cause is that config and runtime hash storage are conflated).
   
   5. **Ownership boundaries unclear** — when `ClusterStore` calls 
`replaceClusterRuntime`, the old runtime is returned and stopped by the caller. 
But there's no clear contract on who owns the `ClusterConfig` lifecycle, or 
when it's safe to mutate `Config.Endpoints` vs. when the runtime has published 
a snapshot from it.
   
   ## Goal
   
   Establish a clean separation:
   
   - **Configuration layer** (`ClusterStore.Config`, `ClusterConfig`) 
represents desired state. Immutable after publication to runtime. External 
mutations go through `ClusterManager` APIs that trigger controlled runtime 
updates.
   - **Runtime layer** (`clustersMap[name]`, `Cluster`, `EndpointSnapshot`) 
represents operational state. Holds its own copy (or read-only view) of config. 
Runtime state (cursors, health, snapshots) never leaks into config structs.
   - **Clear ownership** — who owns each object, when it's safe to mutate, and 
how lifecycle (start/stop/replace) is managed.
   
   ## Suggested approach (sequenced steps)
   
   This is a **series**, not a single PR. Each step should be independently 
reviewable and have clear before/after semantics.
   
   ### Step 1: Make `Cluster.Config` private + add getters
   
   - Rename `Cluster.Config` → `Cluster.config` (unexported).
   - Add `func (c *Cluster) Config() *model.ClusterConfig` that returns a 
**read-only view** (either a shallow copy with critical fields frozen, or a 
wrapper that panics on write — TBD during implementation).
   - Audit all ~20 external read sites and replace `cluster.Config.X` with 
`cluster.Config().X`.
   - For the few write sites (if any exist outside `ClusterManager`), route 
them through explicit `ClusterManager` mutation APIs or flag them as bugs.
   
   **Goal:** No external code can directly mutate `Cluster.config` after it's 
constructed.
   
   ### Step 2: `replaceClusterRuntime` deep-clones `ClusterConfig`
   
   - Instead of `s.clustersMap[name] = 
cluster.NewClusterWithEndpointSnapshot(config, previous)` (shared pointer), 
deep-clone `config` first:
     ```go
     runtimeConfig := model.CloneClusterConfig(config)  // new helper
     s.clustersMap[name] = 
cluster.NewClusterWithEndpointSnapshot(runtimeConfig, previous)
     ```
   - The runtime cluster now owns its config copy. Mutations to 
`store.Config[i]` no longer affect the running cluster until the next 
`replaceClusterRuntime`.
   - The identity check `runtimeCluster.Config != clusterConfig` becomes a 
generation/version check instead of pointer equality (consider adding 
`ClusterConfig.Generation int64` or similar).
   
   **Goal:** `store.Config` and `clustersMap[name].Config()` point to 
independent objects.
   
   ### Step 3: Move RoundRobin cursor out of `ClusterConfig`
   
   - Add a `RuntimeState` struct (or similar) that holds operational counters:
     ```go
     type RuntimeState struct {
         roundRobinCursor uint32
     }
     ```
   - `Cluster` holds `runtimeState *RuntimeState`.
   - `PickContext` carries `runtimeState` alongside `Config` and `Snapshot`.
   - RoundRobin reads/writes the cursor from `runtimeState`, not `Config`.
   - Legacy RoundRobin path and snapshot RoundRobin path now have **independent 
cursors** (or share one in `runtimeState` if that's the desired semantic — but 
either way, it's not on the config struct).
   
   **Goal:** `ClusterConfig` is pure desired-state data. No runtime operational 
fields.
   
   ### Step 4: Audit and document ownership boundaries
   
   - Write down the lifecycle contract:
     - Who creates `ClusterConfig`? (Answer: user yaml / xDS / 
`ClusterManager.AddCluster`)
     - When is it safe to mutate? (Answer: only before passing to `AddCluster` 
/ `UpdateCluster`, or inside `prepareClusterConfig` before runtime publication)
     - Who owns runtime `Cluster`? (Answer: `ClusterStore.clustersMap`, 
lifecycle managed by `replaceClusterRuntime` / `stopClusters`)
     - When is a `Cluster` stopped? (Answer: on `UpdateCluster` / 
`DeleteCluster` / `CompareAndSetStore`)
   - Add godoc on `ClusterConfig`, `Cluster`, `ClusterStore` summarizing the 
contract.
   - Add a test that tries to mutate `store.Config[i].Endpoints` after 
`AddCluster` and asserts the runtime cluster doesn't see it (proving the clone).
   
   **Goal:** Clear ownership rules documented and enforced by tests.
   
   ### Step 5 (optional stretch): Immutable `ClusterConfig` with builder pattern
   
   If steps 1-4 reveal lots of subtle mutation sites, consider:
   
   - `ClusterConfig` becomes append-only (all fields have setters that return a 
new copy).
   - Or use a builder pattern (`ClusterConfigBuilder`) that produces immutable 
`ClusterConfig` on `.Build()`.
   
   This is a bigger refactor and may not be worth it if steps 1-4 already 
establish clear boundaries. Defer unless mutation bugs keep surfacing.
   
   ## Acceptance criteria
   
   - `Cluster.Config` is unexported; external reads go through a getter that 
returns a read-only view.
   - `store.Config[i]` and `clustersMap[name].Config()` point to independent 
objects (verified by pointer inequality test).
   - RoundRobin cursor is in runtime state, not in `ClusterConfig`.
   - Ownership contract is documented in godoc and enforced by tests.
   - Mutating `store.Config[i]` after `AddCluster` / `UpdateCluster` has no 
effect on the running cluster until the next explicit update API call.
   
   ## Relationship to other follow-ups
   
   - **Independent of** #938-950, #955-957 (those are perf / API cleanup; this 
is structural)
   - **Complements** #957 (lazy `Config.ConsistentHash`) — once config and 
runtime are separate, it's clearer that `Config.ConsistentHash.Hash` is 
legacy-only
   
   ## Files
   
   - `pkg/cluster/cluster.go`
   - `pkg/server/cluster_manager.go`
   - `pkg/model/cluster.go`
   - `pkg/cluster/loadbalancer/roundrobin/round_robin.go` (step 3)
   - `pkg/cluster/loadbalancer/load_balancer.go` (step 3)
   
   ## Priority
   
   Medium. State isolation is already complete (#932), so this is not a 
correctness blocker. But it's foundational for:
   
   - Making the cluster runtime easier to reason about (clear mutation 
boundaries)
   - Enabling future work like dynamic reload without restart, or separating 
control-plane config from data-plane runtime
   - Reducing the risk of accidental config mutation bugs
   
   Suggest tackling this after the current perf follow-ups (#955-957) land, so 
the architecture is stable before the next round of optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to