Re: [PR] fix: fail fast on incompatible storage version (boot + runtime paths) [skywalking-banyandb]

via GitHub Sat, 09 May 2026 03:35:53 -0700


Copilot commented on code in PR #1124:
URL: 
https://github.com/apache/skywalking-banyandb/pull/1124#discussion_r3212996739



##########
pkg/schema/cache.go:
##########
@@ -108,7 +109,29 @@ func (sr *schemaRepo) SendMetadataEvent(event 
MetadataEvent) {
                return
        }
        defer sr.closer.SenderDone()
+       // SendMetadataEvent runs synchronously on the caller goroutine (e.g. an
+       // etcd-watch handler such as (*schemaRepo).OnAddOrUpdate). A panic from
+       // processEvent — string-typed today, error-typed once Step 1b lands —
+       // would otherwise propagate to a caller that has no recover() of its 
own.
+       // The classifier branch escalates permanent errors to .Fatal() so
+       // runtime registration of an incompatible group exits the process the
+       // same way the boot path does; non-permanent panics are logged and
+       // swallowed (no regression vs status quo).
+       defer func() {
+               if recovered := recover(); recovered != nil {
+                       if recErr, isErr := recovered.(error); isErr && 
initerror.IsPermanent(recErr) {
+                               sr.l.Fatal().Err(recErr).Interface("event", 
event).Str("stack", string(debug.Stack())).
+                                       Msg("SendMetadataEvent hit a permanent 
error, refusing to continue")
+                       }
+                       sr.l.Warn().Interface("err", 
recovered).Interface("event", event).Str("stack", string(debug.Stack())).
+                               Msg("recovered from SendMetadataEvent panic")

Review Comment:
   The new SendMetadataEvent defer-recover swallows non-permanent panics (logs 
a warning and then returns) without re-queuing the event. That means a 
transient panic inside processEvent (or below it) will drop the metadata event 
on the floor and leave the in-memory cache permanently stale. Consider 
converting recovered non-permanent panics into an error and feeding it through 
the existing transient-retry path (metrics + enqueue onto eventCh), or 
re-panicking so the caller can handle/restart instead of silently losing the 
update.
   



##########
banyand/internal/storage/tsdb.go:
##########
@@ -239,15 +240,31 @@ func OpenTSDB[T TSTable, O any](ctx context.Context, opts 
TSDBOpts[T, O], cache
        lockPath := filepath.Join(opts.Location, lockFilename)
        lock, err := tsdbLfs.CreateLockFile(lockPath, FilePerm)
        if err != nil {
-               logger.Panicf("cannot create lock file %s: %s", lockPath, err)
+               panic(fmt.Errorf("cannot create lock file %s: %w", lockPath, 
err))
        }
        db.lock = lock
+       // Release the lock fd if any subsequent step in OpenTSDB returns an 
error.
+       // Otherwise the file descriptor leaks until process exit and a retry 
on the
+       // same data dir hits "resource temporarily unavailable" from the next
+       // CreateLockFile call. Ownership transfers to (*database).Close() only 
on
+       // success — released = true is set on the success return below.
+       released := false
+       defer func() {
+               if !released {
+                       _ = lock.Close()
+                       _ = tsdbLfs.DeleteFile(lockPath)
+               }
+       }()
        if err := db.segmentController.open(); err != nil {
                return nil, err
        }
        obsservice.MetricsCollector.Register(location, db.collect)
        db.disableRotation = opts.DisableRotation
-       return db, db.startRotationTask()
+       if err := db.startRotationTask(); err != nil {
+               return nil, err
+       }
+       released = true
+       return db, nil

Review Comment:
   OpenTSDB now returns (nil, err) when startRotationTask() fails, but by that 
point segmentController.open() has succeeded and 
MetricsCollector.Register(location, db.collect) has already been called. 
Because the partially-initialized db is no longer returned to the caller, 
there’s no way for the caller to Close() and unregister/cleanup, so this error 
path can leak resources and leave the metrics collector registered. Consider 
cleaning up (e.g., unregister + close db/segmentController/scheduler) before 
returning the error, or preserve the previous behavior of returning the db 
alongside the error so the caller can close it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: fail fast on incompatible storage version (boot + runtime paths) [skywalking-banyandb]

Reply via email to