hanahmily opened a new pull request, #1153: URL: https://github.com/apache/skywalking-banyandb/pull/1153
## Problem `fodc-proxy` silently corrupts Prometheus metric **types**. On a live cluster the proxy `/metrics` exposed **0 counters and 0 summaries** — every counter was emitted as `gauge`, several `_count`-suffixed counters as `histogram`, and `go_gc_duration_seconds` as two conflicting `# TYPE` lines. ## Root cause Type information was destroyed before the proxy could ever emit it, across the whole pipeline `banyandb /metrics → agent → gRPC StreamMetrics → proxy /metrics`: 1. **Agent parser** captured `# HELP` but dropped `# TYPE` entirely (`RawMetric` had no type field). 2. **Flight recorder** stored descriptions but no types. 3. **`StreamMetrics` proto** `Metric` had no type field. 4. **Proxy aggregator** `AggregatedMetric` had no type field. 5. **Proxy formatter** guessed the type from a name-suffix heuristic: anything ending `_bucket/_sum/_count` → `histogram`, else `gauge`. This downgraded counters, mislabeled banyandb's `_count`-suffixed counters as histograms (breaking `histogram_quantile`), and split summaries into conflicting `histogram`+`gauge` lines (which Prometheus rejects). ## Fix Thread the real metric type end-to-end: - **Capture** the type using the Prometheus `expfmt` parser (authoritative `MetricFamily.Type`), resolving histogram/summary component series (`_bucket/_sum/_count`) to their family type by base name. The existing line-based value/label extraction is unchanged, so wire fidelity is preserved. - **Store** it alongside descriptions in the flight recorder (`Datasource.types` + `GetTypes()`). - **Forward** it via a new additive `Metric.type` enum (`MetricType`) over gRPC. - **Emit** the real type from the proxy. The proxy groups by family base and emits exactly one `# TYPE` line per family. - **Mixed-version safety:** pre-upgrade agents send `METRIC_TYPE_UNSPECIFIED`; the proxy folds those untyped samples into the matching typed family (and falls back to the legacy heuristic for fully-untyped families), so a rolling upgrade never produces two conflicting `# TYPE` lines for one metric. - The agent's own `/metrics` exporter now maps counter/gauge correctly and degrades histogram/summary scalars to `untyped` (`NewConstMetric` can't reconstruct them). ## Tests - Parser test: counter/gauge/histogram type propagation incl. base-name resolution for component series. - Proxy formatter tests: counter stays `counter`; a `_count`-suffixed counter is **not** folded into a histogram; histogram and summary families each emit exactly one `# TYPE` line; and two mixed-version dedup tests (typed + untyped same family → single authoritative `# TYPE` line, both samples retained). ## Verification on a live cluster Built the images, rolled the agent sidecar (7 banyandb pods) + proxy on a live showcase cluster, and scraped the proxy `/metrics` before/after: | `# TYPE` distribution | counter | gauge | histogram | summary | |---|---|---|---|---| | Before | 0 | 367 | 35 | 0 | | After | 113 | 255 | 7 | 1 | `rate()`/`increase()` on counters, `histogram_quantile()` on real histograms, and the GC summary quantiles all work again; no metric name has more than one `# TYPE` line. Note: `api/.../rpc.pb.go` is generated (gitignored) and regenerated by CI from the `rpc.proto` change in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
