Re: [PR] fix(property): prevent gossip propagation from getting stuck after replica scale-up [skywalking-banyandb]

via GitHub Tue, 12 May 2026 07:12:31 -0700


Copilot commented on code in PR #1127:
URL: 
https://github.com/apache/skywalking-banyandb/pull/1127#discussion_r3227096687



##########
banyand/property/db/repair_gossip.go:
##########
@@ -307,13 +357,15 @@ func (r *repairGossipClient) Rev(ctx context.Context, 
tracer gossip.Trace, nextN
                        syncSpan := tracer.CreateSpan(startSyncSpan, "repair 
property")
                        syncSpan.Tag(gossip.TraceTagOperateType, 
gossip.TraceTagOperateRepairProperty)
                        syncSpan.Tag(gossip.TraceTagPropertyID, 
string(sync.Property.Id))
-                       updated, newer, err := syncShard.repair(ctx, 
sync.Property.Id, sync.Property.Property, sync.Property.DeleteTime)
+                       updated, newer, err := r.executeRepairWithBudget(ctx, 
syncShard, sync.Property.Id, sync.Property.Property, sync.Property.DeleteTime, 
request.Group)
                        syncSpan.Tag("updated", fmt.Sprintf("%t", updated))
                        syncSpan.Tag("has_newer", fmt.Sprintf("%t", newer != 
nil))
                        if err != nil {

Review Comment:
   In the PropertySync handling block, `syncSpan.End()` isn’t guaranteed to 
run. If `updated == false` and either `newer == nil` or `sync.From == 
...MISSING`, the code falls through without ending the span, which can leak 
spans and distort tracing. Consider deferring `syncSpan.End()` immediately 
after creation (or ensuring it’s called on every exit path of this case).



##########
banyand/property/gossip/server.go:
##########
@@ -328,23 +336,31 @@ func (q *protocolHandler) addToProcess(request 
*propertyv1.PropagationRequest, t
        // if the latest round is out of ttl, then needs to change to current 
node to executing
        if time.Since(groupShard.latestTime) > q.s.scheduleInterval/2 {
                groupShard.originalNodeID = request.Context.OriginNode
-               select {
-               case groupShard.channel <- handlingRequestData:
-                       q.notifyNewRequest()
-               default:
-                       q.s.log.Error().Msgf("ready to added propagation into 
group shard %s(%d) in a new round, but it's full", request.Group, 
request.ShardId)
+               if groupShard.pending != nil {
+                       q.s.serverMetrics.totalCoalesced.Inc(1, request.Group)
+                       q.s.log.Info().
+                               Str("group", request.Group).
+                               Uint32("shardID", request.ShardId).
+                               Str("originNode", request.Context.OriginNode).
+                               Msg("propagation request coalesced into pending 
(TTL takeover)")
                }
+               groupShard.pending = handlingRequestData
+               q.notifyNewRequest()
                return true

Review Comment:
   `groupShard.latestTime` is only initialized on first creation and never 
updated when new requests are accepted into `pending`. This makes the TTL check 
`time.Since(groupShard.latestTime) > q.s.scheduleInterval/2` use a stale 
timestamp, so a different originator can be accepted immediately after a recent 
request (because the TTL condition stays true forever after the first expiry). 
Update `latestTime` whenever you enqueue/overwrite `pending` so the TTL window 
is based on the most recent accepted round/activity.



##########
test/property_repair/shared_utils.go:
##########
@@ -334,6 +354,106 @@ func parseRepairSuccessCount(content string) int64 {
        return totalCount
 }
 
+// parseRepairFailureCount parses the repair_failure_count from prometheus 
metrics text.
+func parseRepairFailureCount(content string) int64 {
+       // Look for metric lines like: 
banyandb_property_scheduler_property_repair_failure_count{group="perf-test-group",shard="0"}
 5
+       re := 
regexp.MustCompile(`banyandb_property_scheduler_property_repair_failure_count\{[^}]+\}\s+(\d+(?:\.\d+)?)`)
+       matches := re.FindAllStringSubmatch(content, -1)
+
+       var totalCount int64
+       for _, match := range matches {
+               if len(match) >= 2 {

Review Comment:
   The Prometheus value regexes (e.g., `\d+(?:\.\d+)?`) won’t match valid 
exposition formats like scientific notation (e.g., `1.23e+06`) or `+Inf` 
(histograms). This can cause these parsers to silently return 0 and make the 
integration/perf checks flaky. Consider using a more complete Prometheus float 
pattern (optionally including exponent and `Inf`/`NaN`) and reusing it across 
all `parse*` helpers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(property): prevent gossip propagation from getting stuck after replica scale-up [skywalking-banyandb]

Reply via email to