hanahmily opened a new pull request, #1157: URL: https://github.com/apache/skywalking-banyandb/pull/1157
## Problem The FODC agent labels every scraped metric with `node_role`, but in a live cluster 6 of 7 nodes report `node_role="ROLE_UNSPECIFIED"` instead of their real `ROLE_DATA`/`ROLE_LIAISON`. Only the one node whose sibling container happened to be ready at the right instant reports correctly. ## Root cause In `fodc/agent/internal/cluster/collector.go` the node role is resolved **exactly once at startup**: - `collectLoop` calls `pollCurrentNode(ctx)` a single time → `fetchCurrentNodes` → `fetchCurrentNodeFromEndpoint`, which only retries `maxRetries=3` over ~300ms–1s. - At agent startup the sibling lifecycle/banyandb gRPC server is usually not listening yet, so every attempt fails fast with `dial tcp [::1]:PORT: connect: cannot assign requested address`. `currentNodes` stays empty, `NodeRoleFromNode(nil)` returns `ROLE_UNSPECIFIED`, and that value is captured once (in `cmd/root.go`) and never refreshed — so the wrong role is permanent for the agent's lifetime. Agent-log evidence (all three started cleanly, `restartCount=0`): | pod | startup `GetCurrentNode` | resolved `node_role` | |---|---|---| | `data-hot-1` | succeeded on first poll | `ROLE_DATA` ✓ | | `data-hot-0` | all endpoints failed (`cannot assign requested address`) | `ROLE_UNSPECIFIED` ✗ | | `liaison-0` | endpoint failed | `ROLE_UNSPECIFIED` ✗ | The lifecycle endpoint is reachable seconds later (periodic cluster-state polls succeed), confirming this is purely a startup readiness race, not a permanent connectivity problem. ## Fix Replace the single-shot startup poll with `pollCurrentNodeWithRetry`, which keeps polling with exponential backoff (`initialBackoff`→`maxBackoff`) until a node with a non-`UNSPECIFIED` role is obtained, or a 25s budget (`nodeRoleResolveBudget`, kept below the existing 30s `WaitForNodeFetched` deadline) elapses. It reuses the existing `pollCurrentNode`, respects `ctx` cancellation and the collector's `closer`, and the budget is a struct field so it stays testable without long waits. ## Tests - `TestCollector_PollCurrentNodeWithRetry_ResolvesAfterTransientFailure` — fails the first `maxRetries` `GetCurrentNode` calls (forcing a second outer poll), then asserts the role resolves to `ROLE_DATA`. - `TestCollector_PollCurrentNodeWithRetry_GivesUpAfterBudget` — always-failing client with a 200ms budget; asserts it returns promptly with `ROLE_UNSPECIFIED` (hang guard). `go build ./fodc/agent/...`, `gofmt -l`, `go test -run TestCollector ./fodc/agent/internal/cluster/...`, and `make -C fodc/agent lint` all pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
