hanahmily opened a new pull request, #1130:
URL: https://github.com/apache/skywalking-banyandb/pull/1130
## Summary
- The writable probe goroutine in `pub.checkWritable` was launched via
`run.Go(ctx, ...)` with the caller's request context. When the originating
publish RPC returned, that ctx was canceled and every subsequent
`checkServiceHealth` in the probe loop returned `code=Canceled` before reaching
the peer, so `h.OnAddOrUpdate(nodeCur.md)` was never invoked and the node
stayed evicted from the selector until the liaison restarted.
- Anchor the probe to `context.Background()` instead, consistent with
`pub-failover` and the other detached, service-lifetime goroutines in this
package. Shutdown remains driven by `p.closer.CloseNotify()` inside the select
loop.
- Regression introduced in #1084 (panic diagnostics refactor that swapped
the bare goroutine for `run.Go` and started forwarding the caller's ctx).
## Symptom observed
In a 2-liaison deployment, liaison-1 (which receives all OAP writes) had
`measureLiaisonNodeSel.nodes = [liaison-1]` — its peer was missing from the
selector even though `connMgr` still reported it as active. liaison-0's
selector was correctly `[liaison-0, liaison-1]`. liaison-1's logs showed the
latched probe state:
```
{"level":"warn","module":"SERVER-QUEUE-PUB-LIAISON","topic":"v1:measure-write",
"error":"... rpc error: code = Canceled desc = context canceled",
"node":"...liaison-0...:18912","message":"data node can not ingest data"}
```
Recurring every 17-20s across `v1:measure-write`, `v1:stream-write`, and
`v1:trace-write` topics — one probe per topic, none ever succeeding, because
the probe's gRPC call was canceled before it could leave the client.
## Why the bug is asymmetric
`checkWritable` only fires on publish failures, and OAP only writes to
liaison-1, so only liaison-1 ever triggers the probe path. A single transient
hiccup on liaison-1 → liaison-0 (a slow `clusterv1.Service` health check ≥ 2s)
is enough to remove liaison-0 from liaison-1's selector and latch it out via
the probe-context bug. liaison-0 never publishes to peers, so its selector
keeps the full view it built at startup.
## Test plan
- [x] `make license-check / check-req / build / lint / check`
- [x] `make test-ci PKG=./banyand/...` — 26 suites passed
- [x] `make test-ci PKG=./bydbctl/...` — 1 suite passed
- [x] `make test-ci PKG=./pkg/...` — 41 suites passed
- [x] `make test-ci PKG=./fodc/...` — 16 suites passed
- [x] `make test-ci PKG=./test/integration/standalone/...` — passed
- [x] `make test-ci PKG=./test/integration/distributed/...` — 11 suites
passed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]