hanahmily opened a new pull request, #1160:
URL: https://github.com/apache/skywalking-banyandb/pull/1160

   ### Fix FODC proxy label collision and add a reliable storage-tier label
   
   While investigating why `banyandb_queue_sub_total_msg_received` (and other 
metrics) showed an inconsistent `type="hot"` label across data nodes, I found 
two issues in the FODC proxy's metric aggregation.
   
   #### 1. Node labels clobber metric-intrinsic labels
   
   `Aggregator.ProcessMetricsFromAgent` overlaid each agent's node labels onto 
every metric and **overwrote same-named labels**:
   
   ```go
   for key, value := range metric.Labels { labels[key] = value }
   for key, value := range agentInfo.Labels { labels[key] = value } // 
overwrites on collision
   ```
   
   A data node reports a `type=hot/warm/cold` node label (its storage tier, 
from `--node-labels`). On any node whose agent had resolved its labels, this 
`type` value **overwrote the merge metrics' own intrinsic `type="file"/"mem"` 
dimension** — e.g. `banyandb_measure_total_merged_parts{type="file"}` became 
`{type="hot"}`. Confirmed live: the one node that had resolved its labels 
reported `type="hot"` on its merge metrics while every other node correctly 
reported `type="file"/"mem"`.
   
   **Fix:** apply a node label only when the metric does not already define 
that key, so metric-intrinsic labels win.
   
   #### 2. No reliable way to attribute a metric to a storage tier
   
   The tier was only present via that (colliding, and unreliable) node `type` 
label — and only on agents that won the startup node-info resolution race, so 
most series had no tier at all. 
   
   **Fix:** the proxy now derives a stable, collision-free `tier` label 
(`hot`/`warm`/`cold`) from the data-node pod name (e.g. `…-data-warm-1` → 
`tier="warm"`), independent of whether the agent resolved its node labels. 
Liaison/standalone/non-standard pod names simply get no `tier`.
   
   #### Tests
   
   Adds unit tests for `tierFromPodName` and for both behaviors in 
`ProcessMetricsFromAgent` (metric label wins over node label on collision; tier 
derived from pod name). `go build`, `go vet`, and the package tests pass; 
`gofmt` clean.
   
   #### Known follow-up (agent side, not in this PR)
   
   The root reason the tier/role were unreliable is that the **agent resolves 
its node role and labels once at startup and never refreshes** (`cmd/root.go` → 
`proxy.NewClient`/watchdog snapshot; registration in `proxy/client.go` sends 
them once). On the live cluster only 1 of 7 agents had resolved — the other six 
were frozen at `node_role=ROLE_UNSPECIFIED` with no node labels. #1157 added a 
resolution retry budget but did not make the values refresh after registration. 
The durable fix is to read node role/labels live from the collector and 
re-register when they resolve. This PR makes the proxy robust to that condition 
(and stops the corruption); the agent-side refresh should follow. (Issues are 
disabled on this mirror; tracking here.)
   
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to