hanahmily opened a new pull request, #1124:
URL: https://github.com/apache/skywalking-banyandb/pull/1124

   ### Fix incompatible-storage-version silent boot
   
   - [x] Add a unit test to verify that the fix works.
   - [x] Explain briefly why the bug exists and how to fix it.
   
   #### Why the bug exists
   
   Booting the server against a data directory whose segment files were stamped 
with an unsupported storage version reached \`SERVING\` with the affected 
groups silently un-loaded:
   
   - The version check in \`banyand/internal/storage/version.go\` correctly 
panicked with \`incompatible version 1.3.0\`.
   - The property schema-registry retry loop in 
\`banyand/metadata/schema/property/client.go\` recovered every panic and 
treated it as transient, retrying for 60s.
   - \`pkg/schema/cache.go\`'s \`(*group).initBySchema\` stored \`groupSchema\` 
**before** \`OpenDB\` ran, so a failed first attempt left the cache 
half-initialized; \`pkg/schema/init.go\`'s \`initGroup\` then returned the 
half-broken group on the retry pass without re-running \`initBySchema\`.
   
   Net result: the boot flow declared the handler initialized successfully on 
the second pass; the gRPC and HTTP listeners came up; queries against the 
affected groups returned \`group X not found\`. Empirically reproduced with 
v0.10.2 against v0.9.0-stamped segment data — the server reported \`SERVING\` 
and queries failed silently.
   
   A symmetric runtime hazard existed: when an admin registered a group at 
runtime (etcd watch → \`(*schemaRepo).OnAddOrUpdate\` → \`SendMetadataEvent\` → 
\`processEvent\` → \`storeGroup\` → \`OpenDB\`), the same incompatibility 
produced an error/panic that was silently caught by either the synchronous 
caller (no recover existed there at all) or by the Watcher worker's recover at 
\`pkg/schema/cache.go:240\`, which logged a warning and requeued forever.
   
   #### How the fix works
   
   1. **\`pkg/initerror\`** (new leaf package) exposes a \`Permanent\` error 
contract — interface \`Permanent() bool\`, \`AsPermanent(err)\`, 
\`IsPermanent(err)\`. Stdlib-only imports so storage and the metadata 
schema-registry can communicate without a layering edge.
   2. **\`banyand/internal/storage/version.go\`** wraps 
\`errVersionIncompatible\` with \`initerror.AsPermanent\` at the source so the 
typed error survives the entire propagation path.
   3. **\`pkg/schema/cache.go\`** \`(*group).initBySchema\` is now 
all-or-nothing: schema is stored only after \`OpenDB\` succeeds. Combined with 
**\`pkg/schema/init.go\`**'s \`initGroup\` reconciling a half-broken cached 
group, a failed init can no longer pose as a loaded group on retry.
   4. **\`pkg/schema/init.go\`** \`processGroup\` panics with a wrapped 
\`error\` (typed) instead of \`logger.Panicf\` (string) so the value satisfies 
\`errors.As\` through any \`recover\` site.
   5. **\`banyand/metadata/schema/property/client.go\`** retry classifier reads 
\`initerror.IsPermanent\` and panics with a typed wrapped error instead of 
looping for 60s. A \`timestamp.Clock\` is injected so tests can drive the 
deadline deterministically.
   6. **F5 — runtime path symmetry** (\`pkg/schema/cache.go\`):
      - Watcher worker recover at line 240 classifies \`initerror.IsPermanent\` 
and calls \`r.l.Fatal()\` (\`os.Exit(1)\`); \`panic()\` would only kill the 
worker goroutine.
      - Same classifier on Watcher's \`processEvent\` error-return path before 
requeue.
      - New \`recover()\` added inside \`SendMetadataEvent\` itself 
(synchronous caller), with the same classifier — strict no-regression vs status 
quo where a panic in \`OnAddOrUpdate\` previously crashed the etcd-watch 
goroutine outright.
   7. **F6 (bundled) — \`banyand/internal/storage/tsdb.go\`**: convert two 
string-typed \`logger.Panicf\` sites (lock-create at line 242, lock-delete at 
line 176) to \`panic(fmt.Errorf(\"...: %w\", err))\` so the recover-side 
classifier sees error-typed values; plug a lock-fd leak in \`OpenTSDB\` 
(\`released bool\` + defer-release on any error path between \`CreateLockFile\` 
and the success return). Without this, even a transient \`OpenTSDB\` failure 
self-poisoned the next retry with \`resource temporarily unavailable\`.
   8. **\`Makefile\`** target \`check-import-boundaries\` enforces the layering 
invariants in CI.
   
   #### Tests
   
   - **Unit (\`pkg/initerror/initerror_test.go\`)** — wrap/unwrap/chain 
coverage for the new contract.
   - **Unit (\`pkg/schema/cache_test.go\`)** — 
\`TestInitBySchema_FailedOpenDB_LeavesGroupUninit\`, 
\`TestInitBySchema_PortableGroupOK\`, \`TestInitGroup_RetryAfterFailedOpenDB\` 
lock the all-or-nothing invariant.
   - **Unit (\`banyand/metadata/schema/property/init_handler_test.go\`)** — 
boot-path classifier: permanent panic fails fast under \`MockClock\`, transient 
panic still retries.
   - **Unit (\`pkg/schema/cache_watcher_fatal_test.go\`)** — runtime-path 
classifier via subprocess pattern (\`BANYAND_TEST_F5_FATAL_CHILD\` sentinel) 
for both Watcher worker and \`SendMetadataEvent\`; transient retry still drains 
\`eventCh\`.
   - **Storage integration (\`banyand/internal/storage/tsdb_test.go\`)** — 
\`TestTSDBOpen_RejectsIncompatibleSegment\` builds a real on-disk segment 
skeleton with \`metadata=1.3.0\` via \`newTestSegmentSkeleton\` and asserts the 
wrapped error matches \`IsPermanent\` and 
\`errors.Is(errVersionIncompatible)\`. 
\`TestTSDBOpen_LockReleasedAfterFailedOpen\` is the F6 lock-leak regression 
guard: two-shot open against the same dir, second open must succeed.
   - **End-to-end (\`banyand/cmd/server/main_test.go\`, \`//go:build 
integration\`)** — Phase 1 register \`m1\` group via gRPC, SIGTERM. Phase 2 
plant \`1.3.0\` segment, restart. Server exits non-zero within 10s with 
\`refusing to start\` / \`incompatible version 1.3.0\`. Mirrors the empirical 
reproducer.
   - **End-to-end (\`banyand/cmd/server/main_runtime_test.go\`, \`//go:build 
integration\`)** — pre-stage incompatible segment, register \`m1\` at runtime; 
the synchronous \`SendMetadataEvent\` path catches and exits the process the 
same way.
   
   #### Behavior change
   
   This PR turns a silent \`SERVING\` state into a fast-fail process exit on 
incompatible storage version, on both boot and runtime paths. Operators should 
backup the data dir and downgrade or migrate before retrying.
   
   - [x] Update the [\`CHANGES\` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to