mochengqian opened a new pull request, #962: URL: https://github.com/apache/dubbo-go-pixiu/pull/962
## What Adds observability for endpoint snapshot publication (follow-up to #932). Operators currently have no direct signal for snapshot publish frequency or snapshot endpoint size, which makes registry churn, unexpected cluster size, or excessive health-update publication hard to diagnose. ## Metrics Three OpenTelemetry instruments, emitted on each successful snapshot `CompareAndSwap`, labeled only by `cluster`: | Metric | Type | Meaning | |--------|------|---------| | `pixiu_cluster_snapshot_publish_total` | counter | Incremented once per successful snapshot publish | | `pixiu_cluster_snapshot_endpoint_count` | gauge | Total endpoints in the latest published snapshot | | `pixiu_cluster_snapshot_healthy_endpoint_count` | gauge | Healthy endpoints in the latest published snapshot | Instruments follow the existing `pixiu_`/snake_case convention and bind lazily to `otel.GetMeterProvider()` (same pattern as `pkg/filter/metric` and the LLM tokenizer), so they wire into the existing Prometheus exporter automatically. ## Where it emits `recordSnapshotPublish` is called inside the successful-CAS branch of the three publish paths in `pkg/cluster/cluster.go`: - `RefreshEndpointsFrom` — membership/address changes - `UpdateEndpointHealth` — endpoint-keyed health flips - `UpdateEndpointAddressHealth` — address-keyed health flips The counter increments **only** on a real swap: no-op health updates return before CAS and are not counted, which is what makes "excessive health update publication" diagnosable. ## Design notes / trade-offs - **Synchronous emission, not an `ObservableGauge`.** A callback gauge would require a global live-cluster registry to enumerate clusters at scrape time; this PR avoids introducing new global mutable state and emits at the publish site instead. - **Stale series on cluster deletion.** Because emission is synchronous, a deleted cluster's gauge retains its last value until the process restarts. Clusters are long-lived configuration objects with rare name churn, so this is an accepted trade-off rather than a leak. - **One attribute set allocated per publish** (`WithAttributes`). Publish is a low-frequency path (not the pick hot path), so this is not cached; the read/pick path is untouched. - **Scope held deliberately tight:** no duration histogram, generation gauge, or per-state breakdown — those can follow as separate issues. ## Out of scope (per #944) No new metrics backend, no endpoint-level (high-cardinality) labels, and no change to snapshot publication behavior. ## Tests `pkg/cluster/metrics_test.go` uses a `ManualReader` to assert: - publish counter and endpoint/healthy gauges after an endpoint-health flip - no-op health update does **not** increment the counter - address-keyed health flip publishes once and marks all endpoints sharing the address ```bash go test ./pkg/cluster/... ./pkg/server/... go vet ./pkg/cluster/ ``` All pass; `gofmt` clean. Closes #944 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
