[PR] fix(property): prevent gossip propagation from getting stuck after replica scale-up [skywalking-banyandb]

via GitHub Tue, 12 May 2026 07:05:05 -0700


mrproliu opened a new pull request, #1127:
URL: https://github.com/apache/skywalking-banyandb/pull/1127


   ### Fix apache/skywalking#13852
   
   - [x] Add a unit test to verify that the fix works.
   - [x] Explain briefly why the bug exists and how to fix it.
   
   #### Why the bug exists
   
   After replica scale-up, the very first round of property repair could 
deadlock the gossip scheduler. Two latent issues compound:
   
   1. **`size=1` channel with drop-on-full in `groupWithShardPropagation`** — 
when a handler was slow (e.g. a long-running per-property `repair`), subsequent 
same-round / TTL-takeover requests hit the `default` branch of the `select` and 
were silently dropped. Once dropped, the round was never retried — the 
scheduler appeared healthy but no progress was made.
   2. **`groupNotify` cap=10 with drop-on-full** — under bursty load, signals 
could be coalesced away while pending entries existed, leaving them stranded 
because the worker consumed exactly one pending per wake-up.
   3. **No per-property timeout on `syncShard.repair`** — a single slow 
property lookup could consume the entire `perNodeSyncTimeout` (1h) budget, 
stalling the whole repair stream.
   
   #### How it's fixed
   
   - **`banyand/property/gossip/server.go`**: replace the per-shard `channel 
chan *handlingRequest` with a `pending *handlingRequest` pointer using 
**latest-wins coalesce** semantics. Overwriting a non-nil `pending` increments 
a new `totalCoalesced` metric and logs at Info (instead of dropping + logging 
Error). The `processPropagation` worker now **drains all ready pending entries 
per wake-up**, with an inner `CloseNotify` re-check so `GracefulStop` stays 
responsive even with backlog.
   - **`banyand/property/db/repair_gossip.go`**: wrap each 
`syncShard.repair(...)` call (client `processDifferTreeSummary` and server 
`processPropertySync` paths) with `context.WithTimeout(ctx, 
repairPerPropertyTimeout)` (10s). A single slow property no longer monopolizes 
the per-node sync budget; the existing per-entry error handling cleanly skips a 
timed-out property and continues the loop/stream.
   - New metric: `total_coalesced{group}`.
   
   #### Tests
   
   - `banyand/property/gossip/service_test.go`: 5 new ginkgo cases covering 
latest-wins reach-Rev, TTL-expired takeover (empty + non-nil pending), 
drain-all per wake-up under saturated `groupNotify`, and `CloseNotify` honored 
during drain.
   - `banyand/property/db/repair_gossip_test.go`: 1 new case verifying a single 
property repair timeout does not abort the round.
   - All existing tests pass under `-race`.
   
   ---
   
   - [x] If this pull request closes/resolves/fixes an existing issue, replace 
the issue number. Fixes apache/skywalking#13852.
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(property): prevent gossip propagation from getting stuck after replica scale-up [skywalking-banyandb]

Reply via email to