[PR] fix(handoff): prevent enqueuing parts for online nodes via shared LocateAll [skywalking-banyandb]

via GitHub Sun, 12 Apr 2026 18:31:14 -0700


hanahmily opened a new pull request, #1065:
URL: https://github.com/apache/skywalking-banyandb/pull/1065


   <!-- ==== 🐛 Remove this line WHEN AND ONLY WHEN you're fixing a bug, follow 
the checklist 👇 ====
   ### Fix liaison nodes enqueuing parts to handoff queue for online/healthy 
data nodes
   
   - [x] Add a unit test to verify that the fix works.
   - [x] Explain briefly why the bug exists and how to fix it.
   
   **Root cause:** `resolveAssignments` (handoff) and syncer's `GetNodes` 
independently computed node lists using `RoundRobinSelector`, but their lookup 
tables could diverge because the selector maintains a global index across all 
groups (measure, stream, trace). This caused the handoff controller to enqueue 
parts for nodes that the syncer was already delivering to, resulting in 
spurious handoff enqueues for healthy data nodes.
   
   **Fix:** Extract `LocateAll` on the shared `NodeRegistry` interface so both 
`resolveAssignments` and `GetNodes` always produce identical node lists. Also 
reorder `calculateOfflineNodes` to check the syncer's online set before the 
health check, preventing enqueues for nodes the syncer already targets.
   
   Additionally fixes:
   - TOCTOU race in handoff queue size limit (use atomic `tryReserveSize`)
   - sidx timestamp corruption in handoff replay (populate 
MinTimestamp/MaxTimestamp)
   - File descriptor leaks in handoff replay tests (close lock files after 
writing)
   ==== 🐛 Remove this line WHEN AND ONLY WHEN you're fixing a bug, follow the 
checklist 👆 ==== -->
   
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
   
   ## Summary
   
   - Extract `LocateAll` method on `NodeRegistry` interface to ensure 
`resolveAssignments` and syncer `GetNodes` always produce identical node lists, 
preventing liaison from enqueuing parts to online/healthy data nodes.
   - Reorder `calculateOfflineNodes` to check online set before health check.
   - Fix TOCTOU race allowing disk size limit bypass, and populate sidx 
MinTimestamp/MaxTimestamp during replay.
   - Close lock files after writing in handoff replay tests to prevent FD leaks.
   
   ## Test plan
   
   - [x] Unit tests: `TestHandoffController_FiltersNonOwningOfflineNodes` 
verifies non-owning nodes are filtered
   - [x] Unit tests: `TestHandoffController_OnlineNodeSkipsStaleHealthCheck` 
verifies online nodes are never enqueued even with stale health checks
   - [x] Unit tests: `TestHandoffController_OfflineNodeNotInOnlineSet` verifies 
truly offline nodes are correctly classified
   - [x] Unit tests: `TestHandoffController_SizeEnforcement`, 
`TestHandoffController_ConcurrentSizeEnforcement` verify atomic size limit
   - [x] Unit tests: `TestHandoffController_SizeRecovery` verifies size 
recovery on restart
   - [x] `make lint`, `make check` pass
   - [x] All handoff-related test suites pass (`banyand/trace`, 
`banyand/liaison/grpc`)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(handoff): prevent enqueuing parts for online nodes via shared LocateAll [skywalking-banyandb]

Reply via email to