hanahmily opened a new pull request, #1157:
URL: https://github.com/apache/skywalking-banyandb/pull/1157

   ## Problem
   
   The FODC agent labels every scraped metric with `node_role`, but in a live 
cluster 6 of 7 nodes report `node_role="ROLE_UNSPECIFIED"` instead of their 
real `ROLE_DATA`/`ROLE_LIAISON`. Only the one node whose sibling container 
happened to be ready at the right instant reports correctly.
   
   ## Root cause
   
   In `fodc/agent/internal/cluster/collector.go` the node role is resolved 
**exactly once at startup**:
   
   - `collectLoop` calls `pollCurrentNode(ctx)` a single time → 
`fetchCurrentNodes` → `fetchCurrentNodeFromEndpoint`, which only retries 
`maxRetries=3` over ~300ms–1s.
   - At agent startup the sibling lifecycle/banyandb gRPC server is usually not 
listening yet, so every attempt fails fast with `dial tcp [::1]:PORT: connect: 
cannot assign requested address`. `currentNodes` stays empty, 
`NodeRoleFromNode(nil)` returns `ROLE_UNSPECIFIED`, and that value is captured 
once (in `cmd/root.go`) and never refreshed — so the wrong role is permanent 
for the agent's lifetime.
   
   Agent-log evidence (all three started cleanly, `restartCount=0`):
   
   | pod | startup `GetCurrentNode` | resolved `node_role` |
   |---|---|---|
   | `data-hot-1` | succeeded on first poll | `ROLE_DATA` ✓ |
   | `data-hot-0` | all endpoints failed (`cannot assign requested address`) | 
`ROLE_UNSPECIFIED` ✗ |
   | `liaison-0` | endpoint failed | `ROLE_UNSPECIFIED` ✗ |
   
   The lifecycle endpoint is reachable seconds later (periodic cluster-state 
polls succeed), confirming this is purely a startup readiness race, not a 
permanent connectivity problem.
   
   ## Fix
   
   Replace the single-shot startup poll with `pollCurrentNodeWithRetry`, which 
keeps polling with exponential backoff (`initialBackoff`→`maxBackoff`) until a 
node with a non-`UNSPECIFIED` role is obtained, or a 25s budget 
(`nodeRoleResolveBudget`, kept below the existing 30s `WaitForNodeFetched` 
deadline) elapses. It reuses the existing `pollCurrentNode`, respects `ctx` 
cancellation and the collector's `closer`, and the budget is a struct field so 
it stays testable without long waits.
   
   ## Tests
   
   - `TestCollector_PollCurrentNodeWithRetry_ResolvesAfterTransientFailure` — 
fails the first `maxRetries` `GetCurrentNode` calls (forcing a second outer 
poll), then asserts the role resolves to `ROLE_DATA`.
   - `TestCollector_PollCurrentNodeWithRetry_GivesUpAfterBudget` — 
always-failing client with a 200ms budget; asserts it returns promptly with 
`ROLE_UNSPECIFIED` (hang guard).
   
   `go build ./fodc/agent/...`, `gofmt -l`, `go test -run TestCollector 
./fodc/agent/internal/cluster/...`, and `make -C fodc/agent lint` all pass.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to