hanahmily opened a new pull request, #1153:
URL: https://github.com/apache/skywalking-banyandb/pull/1153

   ## Problem
   
   `fodc-proxy` silently corrupts Prometheus metric **types**. On a live 
cluster the proxy `/metrics` exposed **0 counters and 0 summaries** — every 
counter was emitted as `gauge`, several `_count`-suffixed counters as 
`histogram`, and `go_gc_duration_seconds` as two conflicting `# TYPE` lines.
   
   ## Root cause
   
   Type information was destroyed before the proxy could ever emit it, across 
the whole pipeline `banyandb /metrics → agent → gRPC StreamMetrics → proxy 
/metrics`:
   
   1. **Agent parser** captured `# HELP` but dropped `# TYPE` entirely 
(`RawMetric` had no type field).
   2. **Flight recorder** stored descriptions but no types.
   3. **`StreamMetrics` proto** `Metric` had no type field.
   4. **Proxy aggregator** `AggregatedMetric` had no type field.
   5. **Proxy formatter** guessed the type from a name-suffix heuristic: 
anything ending `_bucket/_sum/_count` → `histogram`, else `gauge`. This 
downgraded counters, mislabeled banyandb's `_count`-suffixed counters as 
histograms (breaking `histogram_quantile`), and split summaries into 
conflicting `histogram`+`gauge` lines (which Prometheus rejects).
   
   ## Fix
   
   Thread the real metric type end-to-end:
   
   - **Capture** the type using the Prometheus `expfmt` parser (authoritative 
`MetricFamily.Type`), resolving histogram/summary component series 
(`_bucket/_sum/_count`) to their family type by base name. The existing 
line-based value/label extraction is unchanged, so wire fidelity is preserved.
   - **Store** it alongside descriptions in the flight recorder 
(`Datasource.types` + `GetTypes()`).
   - **Forward** it via a new additive `Metric.type` enum (`MetricType`) over 
gRPC.
   - **Emit** the real type from the proxy. The proxy groups by family base and 
emits exactly one `# TYPE` line per family.
   - **Mixed-version safety:** pre-upgrade agents send 
`METRIC_TYPE_UNSPECIFIED`; the proxy folds those untyped samples into the 
matching typed family (and falls back to the legacy heuristic for fully-untyped 
families), so a rolling upgrade never produces two conflicting `# TYPE` lines 
for one metric.
   - The agent's own `/metrics` exporter now maps counter/gauge correctly and 
degrades histogram/summary scalars to `untyped` (`NewConstMetric` can't 
reconstruct them).
   
   ## Tests
   
   - Parser test: counter/gauge/histogram type propagation incl. base-name 
resolution for component series.
   - Proxy formatter tests: counter stays `counter`; a `_count`-suffixed 
counter is **not** folded into a histogram; histogram and summary families each 
emit exactly one `# TYPE` line; and two mixed-version dedup tests (typed + 
untyped same family → single authoritative `# TYPE` line, both samples 
retained).
   
   ## Verification on a live cluster
   
   Built the images, rolled the agent sidecar (7 banyandb pods) + proxy on a 
live showcase cluster, and scraped the proxy `/metrics` before/after:
   
   | `# TYPE` distribution | counter | gauge | histogram | summary |
   |---|---|---|---|---|
   | Before | 0 | 367 | 35 | 0 |
   | After  | 113 | 255 | 7 | 1 |
   
   `rate()`/`increase()` on counters, `histogram_quantile()` on real 
histograms, and the GC summary quantiles all work again; no metric name has 
more than one `# TYPE` line.
   
   Note: `api/.../rpc.pb.go` is generated (gitignored) and regenerated by CI 
from the `rpc.proto` change in this PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to