hanahmily opened a new pull request, #1160:
URL: https://github.com/apache/skywalking-banyandb/pull/1160
### Fix FODC proxy label collision and add a reliable storage-tier label
While investigating why `banyandb_queue_sub_total_msg_received` (and other
metrics) showed an inconsistent `type="hot"` label across data nodes, I found
two issues in the FODC proxy's metric aggregation.
#### 1. Node labels clobber metric-intrinsic labels
`Aggregator.ProcessMetricsFromAgent` overlaid each agent's node labels onto
every metric and **overwrote same-named labels**:
```go
for key, value := range metric.Labels { labels[key] = value }
for key, value := range agentInfo.Labels { labels[key] = value } //
overwrites on collision
```
A data node reports a `type=hot/warm/cold` node label (its storage tier,
from `--node-labels`). On any node whose agent had resolved its labels, this
`type` value **overwrote the merge metrics' own intrinsic `type="file"/"mem"`
dimension** — e.g. `banyandb_measure_total_merged_parts{type="file"}` became
`{type="hot"}`. Confirmed live: the one node that had resolved its labels
reported `type="hot"` on its merge metrics while every other node correctly
reported `type="file"/"mem"`.
**Fix:** apply a node label only when the metric does not already define
that key, so metric-intrinsic labels win.
#### 2. No reliable way to attribute a metric to a storage tier
The tier was only present via that (colliding, and unreliable) node `type`
label — and only on agents that won the startup node-info resolution race, so
most series had no tier at all.
**Fix:** the proxy now derives a stable, collision-free `tier` label
(`hot`/`warm`/`cold`) from the data-node pod name (e.g. `…-data-warm-1` →
`tier="warm"`), independent of whether the agent resolved its node labels.
Liaison/standalone/non-standard pod names simply get no `tier`.
#### Tests
Adds unit tests for `tierFromPodName` and for both behaviors in
`ProcessMetricsFromAgent` (metric label wins over node label on collision; tier
derived from pod name). `go build`, `go vet`, and the package tests pass;
`gofmt` clean.
#### Known follow-up (agent side, not in this PR)
The root reason the tier/role were unreliable is that the **agent resolves
its node role and labels once at startup and never refreshes** (`cmd/root.go` →
`proxy.NewClient`/watchdog snapshot; registration in `proxy/client.go` sends
them once). On the live cluster only 1 of 7 agents had resolved — the other six
were frozen at `node_role=ROLE_UNSPECIFIED` with no node labels. #1157 added a
resolution retry budget but did not make the values refresh after registration.
The durable fix is to read node role/labels live from the collector and
re-register when they resolve. This PR makes the proxy robust to that condition
(and stops the corruption); the agent-side refresh should follow. (Issues are
disabled on this mirror; tracking here.)
- [x] Update the [`CHANGES`
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]