This is an automated email from the ASF dual-hosted git repository.
hanahmily pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
The following commit(s) were added to refs/heads/main by this push:
new af5970f0c fix(queue/pub): keep pub-node-probe alive after caller ctx
cancel (#1130)
af5970f0c is described below
commit af5970f0c2ceba1c320bba60a2d17072892cdafe
Author: Gao Hongtao <[email protected]>
AuthorDate: Fri May 15 09:23:51 2026 +0800
fix(queue/pub): keep pub-node-probe alive after caller ctx cancel (#1130)
The writable probe goroutine was inheriting the originating publish's
request context via run.Go(ctx, ...). When the request returned, that
ctx was canceled and every checkServiceHealth call in the probe loop
returned code=Canceled before reaching the peer, so h.OnAddOrUpdate was
never invoked and the node stayed evicted from the selector until the
liaison restarted. Anchor the probe to context.Background() instead, in
line with pub-failover and the other detached, service-lifetime
goroutines in this package. Shutdown is unchanged: the loop's select
already exits on p.closer.CloseNotify().
A //nolint:contextcheck directive is added since the surrounding
checkWritable has a parent ctx in scope; this matches the existing
suppression pattern used at pub.go:436.
Regression introduced in #1084 (panic diagnostics refactor that swapped
the bare goroutine for run.Go and started forwarding the caller's ctx).
Co-authored-by: Claude Opus 4.7 <[email protected]>
---
banyand/queue/pub/client.go | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/banyand/queue/pub/client.go b/banyand/queue/pub/client.go
index 433f24356..0fe3b9731 100644
--- a/banyand/queue/pub/client.go
+++ b/banyand/queue/pub/client.go
@@ -215,7 +215,13 @@ func (p *pub) checkWritable(ctx context.Context, n string,
topic bus.Topic) (boo
p.writableProbeMu.Unlock()
probeName, probeTopic := n, topicStr
- run.Go(ctx, "pub-node-probe", p.log, func(probeCtx context.Context) {
+ // The probe outlives the caller's request, so it must not inherit the
+ // caller's ctx: when the originating publish returns, that ctx is
+ // canceled and every checkServiceHealth call below would immediately
+ // fail with code=Canceled, latching the node out of the selector
+ // forever. Shutdown is handled by p.closer.CloseNotify() in the
+ // select below.
+ run.Go(context.Background(), "pub-node-probe", p.log, func(probeCtx
context.Context) { //nolint:contextcheck // probe is service-lifetime, must not
inherit caller ctx
defer p.closer.Done()
defer func() {
p.writableProbeMu.Lock()