hanahmily commented on code in PR #1115:
URL:
https://github.com/apache/skywalking-banyandb/pull/1115#discussion_r3198267128
##########
banyand/liaison/grpc/deletion.go:
##########
@@ -452,7 +452,7 @@ func (m *groupDeletionTaskManager) getDeletionTask(ctx
context.Context, group st
}
func (m *groupDeletionTaskManager) hasNonEmptyResources(ctx context.Context,
group string) (bool, error) {
- dataInfo, dataErr := m.schemaRegistry.CollectDataInfo(ctx, group)
+ dataInfo, _, dataErr := m.schemaRegistry.CollectDataInfo(ctx, group)
Review Comment:
**(#1) `_` discards `collectionErrors` from `CollectDataInfo` — silent
partial-failure preservation.**
This site (and the matching ones at `deletion.go:161` `startDeletion` and
`liaison/grpc/registry.go:771` `Inspect`) drops the new per-node-error slice
that the rest of this PR plumbs through `CollectDataInfo`. The signature was
changed precisely so callers could distinguish *no resources* from *some nodes
did not answer*, but here we throw that information away and act on a
possibly-incomplete `dataInfo`.
Concrete failure mode for `hasNonEmptyResources`: a group has data on nodes
A, B, C; C is unreachable. Post-fix, `CollectDataInfo` returns `dataInfo =
[A_empty, B_empty]`, `collectionErrors = ["node error: C panic"]`, `err = nil`.
This function then returns `(false, nil)` — *group is empty, safe to proceed* —
but C may still hold data. The same shape applies to `Inspect` returning an
incomplete view, and to `startDeletion` proceeding against stale data.
Pre-PR this was the same behavior (errors were `Warn`-logged and dropped at
the collector), so it is not a regression. But the PR is the change that makes
the truth available; it should be used at the call site rather than discarded.
Suggested fix, in order of conservatism:
1. At minimum, log the discarded errors at this call site so they are not
invisible.
2. Better: if `len(collectionErrors) > 0`, return an error (or a `partial`
flag) so the caller does not act on a partial view — especially in
`hasNonEmptyResources`, where a wrong *empty* answer drives deletion decisions.
The integration test `TestInspectAll_PropagatesPartialFailureErrors` only
exercises the FODC `InspectAll` surface; these three liaison-side callers have
no equivalent test for partial-failure behavior.
##########
banyand/queue/sub/group_lifecycle.go:
##########
@@ -77,12 +82,14 @@ func (s *server) inspectGroup(ctx context.Context, group
*commonv1.Group) *fodcv
Catalog: catalogToString(group.Catalog),
ResourceOpts: group.ResourceOpts,
}
- dataInfo, err := s.metadataRepo.CollectDataInfo(ctx, groupName)
+ dataInfo, collectionErrs, err := s.metadataRepo.CollectDataInfo(ctx,
groupName)
+ info.Errors = collectionErrs
if err != nil {
s.log.Warn().Err(err).Str("group", groupName).Msg("Failed to
collect data info")
- } else {
- info.DataInfo = dataInfo
+ info.Errors = append([]string{topLevelErrorPrefix +
err.Error()}, info.Errors...)
Review Comment:
**(#3, cleanup) The `append`-prepend on the top-level error path is a
no-op.**
The contract on `CollectDataInfo` is: when the `error` return is non-nil,
`collectionErrs` is `nil` (every error path in `metadata/schema/collector.go`
returns `nil, nil, getErr` / `nil, nil, localErr` / `nil, nil,
fmt.Errorf("unsupported catalog…")`). So when we reach this branch,
`info.Errors` was just set to `nil` two lines up, and the prepend collapses to
a single-element slice.
The defensive `append` reads as if it is preserving real per-node errors
that the top-level failure happened on top of, which can't happen given the
contract. A flatter form makes the contract obvious and removes the misleading
hint:
```go
dataInfo, collectionErrs, err := s.metadataRepo.CollectDataInfo(ctx,
groupName)
if err != nil {
s.log.Warn().Err(err).Str("group", groupName).Msg("Failed to collect
data info")
info.Errors = []string{topLevelErrorPrefix + err.Error()}
return info
}
info.Errors = collectionErrs
info.DataInfo = dataInfo
return info
```
If the worry is that the contract on `CollectDataInfo` could change later —
partial errors *plus* a top-level failure — that's worth a one-line comment on
the collector function and an assertion-style comment here, rather than
silently-defensive code.
##########
banyand/queue/sub/group_lifecycle.go:
##########
@@ -32,6 +32,11 @@ import (
// while still bounding load on the data nodes.
const maxInspectGroupConcurrency = 32
+// topLevelErrorPrefix tags an entry in GroupLifecycleInfo.errors that
+// originated from a top-level CollectDataInfo failure (GetGroup, missing
+// collector, dial failure) rather than a per-node broadcast failure.
+const topLevelErrorPrefix = "top-level: "
Review Comment:
**(#4, cleanup) Error-category prefix constants are split across two
packages with overlapping authorship.**
`metadata/schema/collector.go:45-47` declares `broadcastErrorPrefix`,
`futureErrorPrefix`, `nodeErrorPrefix`; this file declares
`topLevelErrorPrefix`. Together they form a single closed set that downstream
consumers (FODC proxy, anyone parsing `GroupLifecycleInfo.errors`) must
categorize via prefix matching. Splitting them across packages means:
- The prefix scheme is documented in three places: the proto comment on
`errors`, this constant block, and the constant block in `collector.go`.
Drifting any one of them silently breaks consumer parsing.
- If a fifth category is added, it's an open question which package owns the
prefix.
- A consumer reading either constant block sees only half the vocabulary.
Lowest-cost fix: hoist all four prefixes (or at minimum re-export the three
from `collector.go`) into a single package — `metadata/schema/collector.go` is
the natural author since it produces three of the four — and have `queue/sub`
import the canonical constants. `topLevelErrorPrefix` is conceptually owned by
*whoever calls `CollectDataInfo` and observes its `error` return*, which is
also the collector boundary, so co-locating is natural.
If you prefer to leave `topLevelErrorPrefix` here, at least add a one-line
comment cross-referencing `collector.go`'s prefix block so the next reader
knows the full vocabulary lives in two files.
The broader concern — stringly-typed categorization at all — is constrained
by the proto's `repeated string errors` shape, so a structured `CollectionError
{ Source ErrorSource; Msg string }` would only help internally. Out of scope
for this PR; centralizing the prefixes is the cheap win.
##########
banyand/internal/storage/index.go:
##########
@@ -384,6 +420,9 @@ func (s *seriesIndex) SearchWithoutSeries(ctx
context.Context, opts IndexSearchO
}
func (s *seriesIndex) Close() error {
+ if s == nil {
+ return nil
+ }
s.metrics.DeleteAll(s.p.SegLabelValues()...)
Review Comment:
**(#2) Nil-receiver guard only protects against the typed-nil interface
leak, not against partially-constructed `seriesIndex`.**
The `if s == nil { return nil }` block above guards the receiver, but the
next line dereferences `s.metrics` and `s.p` without checking either. If the
typed-nil leak from `segment.IndexDB()` is the only path this guard is intended
to defend against, that is fine — but the same pattern is now repeated across
`Insert`, `Update`, `EnableExternalSegments`, `Stats`, `filter`, `Search`,
`SearchWithoutSeries`, and `Close`. Eight near-identical guards have to stay in
sync with whatever fields `seriesIndex` exposes, and a future caller who hits
one of them will reasonably assume the receiver is *fully* nil-safe.
Two cleaner alternatives, in order of scope:
1. **Document intent.** Add a single comment on the `seriesIndex` type
stating that the nil-receiver guards exist solely to absorb the typed-nil leak
from `segment.IndexDB()` and deliberately do not protect against
partially-constructed instances. The eight guards remain but their contract is
explicit.
2. **Remove the underlying ambiguity.** Switch `segment.index` to
`atomic.Pointer[seriesIndex]`. `IndexDB()` becomes `if idx := s.index.Load();
idx != nil { return idx }; return nil` — formally race-free with respect to
`performCleanup` (which the current plain-pointer read is not, even though the
ref-counting contract saves it in practice), and a typed-nil can never escape.
Every `if s == nil` guard added in this PR can then be deleted. Bigger blast
radius — every read/write of `s.index` in `initialize`, `performCleanup`,
`collectMetrics`, `Lookup` has to switch — but it eliminates the maintenance
debt rather than papering over it.
Not blocking the merge; either path is fine. Wanted to flag the
inconsistency before the pattern proliferates.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]