Re: [PR] Fix fetch segment nil pointer in cold stage and propagation the error when inspect the group [skywalking-banyandb]

via GitHub Wed, 06 May 2026 17:28:54 -0700


hanahmily commented on code in PR #1115:
URL: 
https://github.com/apache/skywalking-banyandb/pull/1115#discussion_r3198267128



##########
banyand/liaison/grpc/deletion.go:
##########
@@ -452,7 +452,7 @@ func (m *groupDeletionTaskManager) getDeletionTask(ctx 
context.Context, group st
 }
 
 func (m *groupDeletionTaskManager) hasNonEmptyResources(ctx context.Context, 
group string) (bool, error) {
-       dataInfo, dataErr := m.schemaRegistry.CollectDataInfo(ctx, group)
+       dataInfo, _, dataErr := m.schemaRegistry.CollectDataInfo(ctx, group)

Review Comment:
   **(#1) `_` discards `collectionErrors` from `CollectDataInfo` — silent 
partial-failure preservation.**
   
   This site (and the matching ones at `deletion.go:161` `startDeletion` and 
`liaison/grpc/registry.go:771` `Inspect`) drops the new per-node-error slice 
that the rest of this PR plumbs through `CollectDataInfo`. The signature was 
changed precisely so callers could distinguish *no resources* from *some nodes 
did not answer*, but here we throw that information away and act on a 
possibly-incomplete `dataInfo`.
   
   Concrete failure mode for `hasNonEmptyResources`: a group has data on nodes 
A, B, C; C is unreachable. Post-fix, `CollectDataInfo` returns `dataInfo = 
[A_empty, B_empty]`, `collectionErrors = ["node error: C panic"]`, `err = nil`. 
This function then returns `(false, nil)` — *group is empty, safe to proceed* — 
but C may still hold data. The same shape applies to `Inspect` returning an 
incomplete view, and to `startDeletion` proceeding against stale data.
   
   Pre-PR this was the same behavior (errors were `Warn`-logged and dropped at 
the collector), so it is not a regression. But the PR is the change that makes 
the truth available; it should be used at the call site rather than discarded.
   
   Suggested fix, in order of conservatism:
   
   1. At minimum, log the discarded errors at this call site so they are not 
invisible.
   2. Better: if `len(collectionErrors) > 0`, return an error (or a `partial` 
flag) so the caller does not act on a partial view — especially in 
`hasNonEmptyResources`, where a wrong *empty* answer drives deletion decisions.
   
   The integration test `TestInspectAll_PropagatesPartialFailureErrors` only 
exercises the FODC `InspectAll` surface; these three liaison-side callers have 
no equivalent test for partial-failure behavior.



##########
banyand/queue/sub/group_lifecycle.go:
##########
@@ -77,12 +82,14 @@ func (s *server) inspectGroup(ctx context.Context, group 
*commonv1.Group) *fodcv
                Catalog:      catalogToString(group.Catalog),
                ResourceOpts: group.ResourceOpts,
        }
-       dataInfo, err := s.metadataRepo.CollectDataInfo(ctx, groupName)
+       dataInfo, collectionErrs, err := s.metadataRepo.CollectDataInfo(ctx, 
groupName)
+       info.Errors = collectionErrs
        if err != nil {
                s.log.Warn().Err(err).Str("group", groupName).Msg("Failed to 
collect data info")
-       } else {
-               info.DataInfo = dataInfo
+               info.Errors = append([]string{topLevelErrorPrefix + 
err.Error()}, info.Errors...)

Review Comment:
   **(#3, cleanup) The `append`-prepend on the top-level error path is a 
no-op.**
   
   The contract on `CollectDataInfo` is: when the `error` return is non-nil, 
`collectionErrs` is `nil` (every error path in `metadata/schema/collector.go` 
returns `nil, nil, getErr` / `nil, nil, localErr` / `nil, nil, 
fmt.Errorf("unsupported catalog…")`). So when we reach this branch, 
`info.Errors` was just set to `nil` two lines up, and the prepend collapses to 
a single-element slice.
   
   The defensive `append` reads as if it is preserving real per-node errors 
that the top-level failure happened on top of, which can't happen given the 
contract. A flatter form makes the contract obvious and removes the misleading 
hint:
   
   ```go
   dataInfo, collectionErrs, err := s.metadataRepo.CollectDataInfo(ctx, 
groupName)
   if err != nil {
       s.log.Warn().Err(err).Str("group", groupName).Msg("Failed to collect 
data info")
       info.Errors = []string{topLevelErrorPrefix + err.Error()}
       return info
   }
   info.Errors = collectionErrs
   info.DataInfo = dataInfo
   return info
   ```
   
   If the worry is that the contract on `CollectDataInfo` could change later — 
partial errors *plus* a top-level failure — that's worth a one-line comment on 
the collector function and an assertion-style comment here, rather than 
silently-defensive code.



##########
banyand/queue/sub/group_lifecycle.go:
##########
@@ -32,6 +32,11 @@ import (
 // while still bounding load on the data nodes.
 const maxInspectGroupConcurrency = 32
 
+// topLevelErrorPrefix tags an entry in GroupLifecycleInfo.errors that
+// originated from a top-level CollectDataInfo failure (GetGroup, missing
+// collector, dial failure) rather than a per-node broadcast failure.
+const topLevelErrorPrefix = "top-level: "

Review Comment:
   **(#4, cleanup) Error-category prefix constants are split across two 
packages with overlapping authorship.**
   
   `metadata/schema/collector.go:45-47` declares `broadcastErrorPrefix`, 
`futureErrorPrefix`, `nodeErrorPrefix`; this file declares 
`topLevelErrorPrefix`. Together they form a single closed set that downstream 
consumers (FODC proxy, anyone parsing `GroupLifecycleInfo.errors`) must 
categorize via prefix matching. Splitting them across packages means:
   
   - The prefix scheme is documented in three places: the proto comment on 
`errors`, this constant block, and the constant block in `collector.go`. 
Drifting any one of them silently breaks consumer parsing.
   - If a fifth category is added, it's an open question which package owns the 
prefix.
   - A consumer reading either constant block sees only half the vocabulary.
   
   Lowest-cost fix: hoist all four prefixes (or at minimum re-export the three 
from `collector.go`) into a single package — `metadata/schema/collector.go` is 
the natural author since it produces three of the four — and have `queue/sub` 
import the canonical constants. `topLevelErrorPrefix` is conceptually owned by 
*whoever calls `CollectDataInfo` and observes its `error` return*, which is 
also the collector boundary, so co-locating is natural.
   
   If you prefer to leave `topLevelErrorPrefix` here, at least add a one-line 
comment cross-referencing `collector.go`'s prefix block so the next reader 
knows the full vocabulary lives in two files.
   
   The broader concern — stringly-typed categorization at all — is constrained 
by the proto's `repeated string errors` shape, so a structured `CollectionError 
{ Source ErrorSource; Msg string }` would only help internally. Out of scope 
for this PR; centralizing the prefixes is the cheap win.



##########
banyand/internal/storage/index.go:
##########
@@ -384,6 +420,9 @@ func (s *seriesIndex) SearchWithoutSeries(ctx 
context.Context, opts IndexSearchO
 }
 
 func (s *seriesIndex) Close() error {
+       if s == nil {
+               return nil
+       }
        s.metrics.DeleteAll(s.p.SegLabelValues()...)

Review Comment:
   **(#2) Nil-receiver guard only protects against the typed-nil interface 
leak, not against partially-constructed `seriesIndex`.**
   
   The `if s == nil { return nil }` block above guards the receiver, but the 
next line dereferences `s.metrics` and `s.p` without checking either. If the 
typed-nil leak from `segment.IndexDB()` is the only path this guard is intended 
to defend against, that is fine — but the same pattern is now repeated across 
`Insert`, `Update`, `EnableExternalSegments`, `Stats`, `filter`, `Search`, 
`SearchWithoutSeries`, and `Close`. Eight near-identical guards have to stay in 
sync with whatever fields `seriesIndex` exposes, and a future caller who hits 
one of them will reasonably assume the receiver is *fully* nil-safe.
   
   Two cleaner alternatives, in order of scope:
   
   1. **Document intent.** Add a single comment on the `seriesIndex` type 
stating that the nil-receiver guards exist solely to absorb the typed-nil leak 
from `segment.IndexDB()` and deliberately do not protect against 
partially-constructed instances. The eight guards remain but their contract is 
explicit.
   2. **Remove the underlying ambiguity.** Switch `segment.index` to 
`atomic.Pointer[seriesIndex]`. `IndexDB()` becomes `if idx := s.index.Load(); 
idx != nil { return idx }; return nil` — formally race-free with respect to 
`performCleanup` (which the current plain-pointer read is not, even though the 
ref-counting contract saves it in practice), and a typed-nil can never escape. 
Every `if s == nil` guard added in this PR can then be deleted. Bigger blast 
radius — every read/write of `s.index` in `initialize`, `performCleanup`, 
`collectMetrics`, `Lookup` has to switch — but it eliminates the maintenance 
debt rather than papering over it.
   
   Not blocking the merge; either path is fine. Wanted to flag the 
inconsistency before the pattern proliferates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix fetch segment nil pointer in cold stage and propagation the error when inspect the group [skywalking-banyandb]

Reply via email to