mrproliu opened a new pull request, #1127:
URL: https://github.com/apache/skywalking-banyandb/pull/1127
### Fix apache/skywalking#13852
- [x] Add a unit test to verify that the fix works.
- [x] Explain briefly why the bug exists and how to fix it.
#### Why the bug exists
After replica scale-up, the very first round of property repair could
deadlock the gossip scheduler. Two latent issues compound:
1. **`size=1` channel with drop-on-full in `groupWithShardPropagation`** —
when a handler was slow (e.g. a long-running per-property `repair`), subsequent
same-round / TTL-takeover requests hit the `default` branch of the `select` and
were silently dropped. Once dropped, the round was never retried — the
scheduler appeared healthy but no progress was made.
2. **`groupNotify` cap=10 with drop-on-full** — under bursty load, signals
could be coalesced away while pending entries existed, leaving them stranded
because the worker consumed exactly one pending per wake-up.
3. **No per-property timeout on `syncShard.repair`** — a single slow
property lookup could consume the entire `perNodeSyncTimeout` (1h) budget,
stalling the whole repair stream.
#### How it's fixed
- **`banyand/property/gossip/server.go`**: replace the per-shard `channel
chan *handlingRequest` with a `pending *handlingRequest` pointer using
**latest-wins coalesce** semantics. Overwriting a non-nil `pending` increments
a new `totalCoalesced` metric and logs at Info (instead of dropping + logging
Error). The `processPropagation` worker now **drains all ready pending entries
per wake-up**, with an inner `CloseNotify` re-check so `GracefulStop` stays
responsive even with backlog.
- **`banyand/property/db/repair_gossip.go`**: wrap each
`syncShard.repair(...)` call (client `processDifferTreeSummary` and server
`processPropertySync` paths) with `context.WithTimeout(ctx,
repairPerPropertyTimeout)` (10s). A single slow property no longer monopolizes
the per-node sync budget; the existing per-entry error handling cleanly skips a
timed-out property and continues the loop/stream.
- New metric: `total_coalesced{group}`.
#### Tests
- `banyand/property/gossip/service_test.go`: 5 new ginkgo cases covering
latest-wins reach-Rev, TTL-expired takeover (empty + non-nil pending),
drain-all per wake-up under saturated `groupNotify`, and `CloseNotify` honored
during drain.
- `banyand/property/db/repair_gossip_test.go`: 1 new case verifying a single
property repair timeout does not abort the round.
- All existing tests pass under `-race`.
---
- [x] If this pull request closes/resolves/fixes an existing issue, replace
the issue number. Fixes apache/skywalking#13852.
- [x] Update the [`CHANGES`
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]