mochengqian opened a new pull request, #962:
URL: https://github.com/apache/dubbo-go-pixiu/pull/962

   ## What
   
   Adds observability for endpoint snapshot publication (follow-up to #932). 
Operators currently have no direct signal for snapshot publish frequency or 
snapshot endpoint size, which makes registry churn, unexpected cluster size, or 
excessive health-update publication hard to diagnose.
   
   ## Metrics
   
   Three OpenTelemetry instruments, emitted on each successful snapshot 
`CompareAndSwap`, labeled only by `cluster`:
   
   | Metric | Type | Meaning |
   |--------|------|---------|
   | `pixiu_cluster_snapshot_publish_total` | counter | Incremented once per 
successful snapshot publish |
   | `pixiu_cluster_snapshot_endpoint_count` | gauge | Total endpoints in the 
latest published snapshot |
   | `pixiu_cluster_snapshot_healthy_endpoint_count` | gauge | Healthy 
endpoints in the latest published snapshot |
   
   Instruments follow the existing `pixiu_`/snake_case convention and bind 
lazily to `otel.GetMeterProvider()` (same pattern as `pkg/filter/metric` and 
the LLM tokenizer), so they wire into the existing Prometheus exporter 
automatically.
   
   ## Where it emits
   
   `recordSnapshotPublish` is called inside the successful-CAS branch of the 
three publish paths in `pkg/cluster/cluster.go`:
   
   - `RefreshEndpointsFrom` — membership/address changes
   - `UpdateEndpointHealth` — endpoint-keyed health flips
   - `UpdateEndpointAddressHealth` — address-keyed health flips
   
   The counter increments **only** on a real swap: no-op health updates return 
before CAS and are not counted, which is what makes "excessive health update 
publication" diagnosable.
   
   ## Design notes / trade-offs
   
   - **Synchronous emission, not an `ObservableGauge`.** A callback gauge would 
require a global live-cluster registry to enumerate clusters at scrape time; 
this PR avoids introducing new global mutable state and emits at the publish 
site instead.
   - **Stale series on cluster deletion.** Because emission is synchronous, a 
deleted cluster's gauge retains its last value until the process restarts. 
Clusters are long-lived configuration objects with rare name churn, so this is 
an accepted trade-off rather than a leak.
   - **One attribute set allocated per publish** (`WithAttributes`). Publish is 
a low-frequency path (not the pick hot path), so this is not cached; the 
read/pick path is untouched.
   - **Scope held deliberately tight:** no duration histogram, generation 
gauge, or per-state breakdown — those can follow as separate issues.
   
   ## Out of scope (per #944)
   
   No new metrics backend, no endpoint-level (high-cardinality) labels, and no 
change to snapshot publication behavior.
   
   ## Tests
   
   `pkg/cluster/metrics_test.go` uses a `ManualReader` to assert:
   
   - publish counter and endpoint/healthy gauges after an endpoint-health flip
   - no-op health update does **not** increment the counter
   - address-keyed health flip publishes once and marks all endpoints sharing 
the address
   
   ```bash
   go test ./pkg/cluster/... ./pkg/server/...
   go vet ./pkg/cluster/
   ```
   
   All pass; `gofmt` clean.
   
   Closes #944


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to