Re: [PR] feat(lifecycle): throttle migration under receiver memory pressure; fix bluge dedup OOM [skywalking-banyandb]

via GitHub Fri, 26 Jun 2026 00:47:45 -0700


hanahmily commented on code in PR #1193:
URL: 
https://github.com/apache/skywalking-banyandb/pull/1193#discussion_r3479914429



##########
banyand/backup/lifecycle/retry.go:
##########
@@ -0,0 +1,133 @@
+// Licensed to Apache Software Foundation (ASF) under one or more contributor
+// license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright
+// ownership. Apache Software Foundation (ASF) licenses this file to you under
+// the Apache License, Version 2.0 (the "License"); you may
+// not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package lifecycle
+
+import (
+       "errors"
+       "fmt"
+       "strings"
+       "time"
+
+       "github.com/cenkalti/backoff/v4"
+
+       "github.com/apache/skywalking-banyandb/banyand/queue"
+       "github.com/apache/skywalking-banyandb/pkg/grpchelper"
+       "github.com/apache/skywalking-banyandb/pkg/logger"
+       "github.com/apache/skywalking-banyandb/pkg/node"
+)
+
+// defaultLifecycleSendRetryTimeout bounds the total wall-clock spent retrying
+// transient failures when streaming a part to a target node (e.g. while the
+// target node restarts). Configurable via --lifecycle-send-retry-timeout.
+const defaultLifecycleSendRetryTimeout = 15 * time.Minute
+
+// lifecycleSendRetryTimeout is the active bound, set by the lifecycle flag.
+var lifecycleSendRetryTimeout = defaultLifecycleSendRetryTimeout
+
+// isNoNodes reports whether err means the target node could not be picked
+// because no node is currently available (e.g. the target is restarting).
+// It matches both the typed sentinel and the plain round-robin error string.
+func isNoNodes(err error) bool {
+       if errors.Is(err, node.ErrNoAvailableNode) {
+               return true
+       }
+       return err != nil && strings.Contains(err.Error(), "no nodes available")
+}
+
+// isTransientSend reports whether err from streaming a part to a node is
+// transient and worth retrying: receiver back-pressure (SERVER_BUSY) or a
+// transient/failover gRPC error such as Unavailable, ResourceExhausted, or a
+// dropped connection.
+func isTransientSend(err error) bool {
+       return errors.Is(err, queue.ErrServerBusy) || 
grpchelper.IsTransientError(err) || grpchelper.IsFailoverError(err)
+}
+
+// retryLogInterval throttles the "still retrying" heartbeat so a long wait for
+// a restarting node logs at most once per minute instead of on every attempt.
+const retryLogInterval = time.Minute
+
+// nodePicker is the one method of node.Selector that the send-retry path 
needs;
+// narrowing it keeps runWithRetry/pickWithRetry decoupled and easy to test.
+type nodePicker interface {
+       Pick(group, name string, shardID, replicaID uint32) (string, error)
+}
+
+// runWithRetry runs op under bounded exponential backoff capped by
+// lifecycleSendRetryTimeout. op returns a transient error to retry or
+// backoff.Permanent(err) to stop immediately. While retrying it logs a
+// heartbeat at most once per minute so a long wait for a restarting node is 
not
+// mistaken for a hang.
+func runWithRetry(l *logger.Logger, stage string, op func() error) error {
+       bo := backoff.NewExponentialBackOff()
+       bo.MaxElapsedTime = lifecycleSendRetryTimeout
+       start := time.Now()
+       var lastLog time.Time
+       return backoff.RetryNotify(op, bo, func(err error, _ time.Duration) {

Review Comment:
   The send-retry loop ignores context cancellation. `backoff.RetryNotify(op, 
bo, ...)` has no context, and the op it ultimately drives (`streamPartToNode` 
-> `SyncStreamingParts`) is called with `context.Background()`. So once a part 
starts retrying, a migration that needs to abort (lifecycle service shutdown, 
operator cancel) stays stuck here for up to `lifecycle-send-retry-timeout` (15m 
default).
   
   Note the asymmetry with the receive side: `protector.WaitWhileHigh` already 
honors its `ctx`, but this send path does not.
   
   Request: thread a cancellable context through and use 
`backoff.RetryNotify(op, backoff.WithContext(bo, ctx), ...)`. The row-replay 
callers (`routeColumnar`, `routeAndEnqueue`) already have a `ctx` in scope to 
pass down; for the file-chunk visitor paths a migration-scoped context would 
also let `streamPartToNode` stop streaming on cancel instead of relying solely 
on the 15m cap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lifecycle): throttle migration under receiver memory pressure; fix bluge dedup OOM [skywalking-banyandb]

Reply via email to