Alanxtl opened a new issue, #3337: URL: https://github.com/apache/dubbo-go/issues/3337
## Background Dubbo-go already has a working metrics pipeline based on the metrics event bus, Prometheus registry/exporter, RPC metrics, registry metrics, metadata metrics, config-center metrics, and application info metrics. The next step is to make these metrics more complete, more stable, and easier to consume in production dashboards and alerts. ## Goals Improve metrics coverage and standardization so users can rely on a stable RED-style observability model across provider, consumer, registry, metadata, and config-center scenarios. ## Proposed Scope ### 1. Standardize metric names and labels - Define a documented label contract for built-in metrics. - Clarify stable labels such as `side`, `protocol`, `interface`, `method`, `group`, `version`, `error_code`, and `error_type`. - Review high-cardinality labels and avoid exposing unstable values by default. - Ensure provider and consumer metrics use symmetric naming where possible. ### 2. Complete RPC error classification Current RPC metrics already include granular counters for timeout, limit, service unavailable, business failure, and unknown failure. This can be improved further: - Extend error classification beyond Triple/gRPC error codes where possible. - Add coverage for Dubbo protocol errors. - Distinguish network failure and codec/serialization failure when the runtime exposes enough information. - Ensure the same error taxonomy is reusable by tracing and logging. ### 3. Clarify RED metrics support Provide a clear out-of-the-box model for: - Rate: request QPS and total request counters. - Errors: total failures and categorized failures. - Duration: RT, aggregated RT, and quantile metrics. This should be reflected in code comments, docs, and sample dashboards. ### 4. Improve component-level metrics layering Review registry, metadata, and config-center metrics and classify them into: - Basic metrics enabled by default when metrics are enabled. - Detailed metrics that may be enabled explicitly to reduce noise. Candidate areas: - Registry register/subscribe/notify/directory metrics. - Metadata push/subscribe/store metrics. - Config-center change metrics. ### 5. Align samples and dashboard queries Update or verify the Prometheus/Grafana sample to use the standardized names and labels: - `dubbo-go-samples/metrics/prometheus_grafana` - Grafana panels for QPS, success rate, error rate, P99 latency, timeout rate, limit rate, and service unavailable rate. - Prometheus query examples in docs. ## Acceptance Criteria - Built-in metrics have documented names, labels, and cardinality guidance. - RPC failure metrics have a consistent taxonomy shared across provider and consumer sides. - Existing Prometheus/Grafana sample dashboards continue to work or are updated with the new metric contract. - Tests cover the standardized metric names/labels and error classification behavior. - Backward compatibility impact is documented if any metric is renamed or deprecated. ## Related Context - Existing metrics implementation: `metrics/*`, `filter/metrics`, `metrics/prometheus` - Existing sample: https://github.com/apache/dubbo-go-samples/tree/main/metrics/prometheus_grafana -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
