hanahmily opened a new pull request, #1118:
URL: https://github.com/apache/skywalking-banyandb/pull/1118

   ## Summary
   
   Closes the BanyanDB merge write-path durability gap that allowed torn parts 
to be created on disk and survive across process restarts 
(apache/skywalking#13862, root cause for #13861). The merger's response to a 
torn part — `logger.Panicf` — is preserved unchanged; only the conditions under 
which a torn part can exist on disk are tightened.
   
   Three layered, independently-correct fixes plus a reproducer test:
   
   1. **`pkg/fs` durability primitives** — `seqWriter.Close()` now propagates 
the `fdatasync` error (was discarded); `localFileSystem.Write()` now `fsync`s 
before close; new `WriteAtomic` helper does write-tmp → file.Sync → rename → 
dir.Sync, with `MustFlushAtomic` as the panic-on-error wrapper and 
`CleanupLeftoverTmp` as the on-open helper.
   2. **Two-phase commit for every metadata-style write** in trace, measure, 
stream, sidx — `metadata.json` / `manifest.json` / `traceID.filter` / 
`tag.type` / `seriesMetadata.bin` / snapshot files all go through `WriteAtomic` 
(or `MustFlushAtomic` for sidx). Streaming-sync sidecars in 
`trace/write_data.go` also switched.
   3. **Per-engine startup cleanup** — each engine's `mustOpenFilePart` / 
`mustOpenPart` calls `fs.CleanupLeftoverTmp` to remove safe post-rename `.tmp` 
leftovers; sidx `loadPartMetadata` no longer silently falls back to legacy 
`meta.bin` when a stranded `manifest.json.tmp` exists (would have masked the 
incomplete-commit state).
   4. **Reproducer test** — `Test_merger_tornSpansBin_stillPanics` forges the 
production-shape on-disk state and asserts the merger still panics with one of 
the canonical fail-fast messages, guarding the contract against future 
refactors.
   
   ## Why the fix shape
   
   Earlier draft converted the six \`offset must be equal to bytesRead\` panics 
to errors and added a `recover()` boundary in `mergeParts`. Per maintainer 
design feedback, the panic-on-corruption pattern is BanyanDB's deliberate 
fail-fast contract and was preserved. The fix is purely preventive: it prevents 
torn parts from being **created** by a crash mid-merge or by a silent 
`fdatasync` error. Torn parts that arrive on disk via external causes (rsync, 
host disk failure, pre-fix-deployment leftovers) still trigger the canonical 
panic.
   
   Two reviewer passes (architect critic + GitHub Copilot CLI) caught 
additional issues that were addressed inline:
   - `WriteAtomic` no longer removes the `.tmp` on rename failure (forensic 
value).
   - 11 redundant `SyncPath` calls removed (each `WriteAtomic` already does 
`dir.Sync`); trace was paying up to 4 dir fsyncs per finalize, now 1.
   - `seriesMetadata.bin` and snapshot files were missed in the original audit; 
both now atomic.
   
   ## Test plan
   
   - [x] \`make license-check\` / \`check-req\` / \`build\` / \`lint\` / 
\`check\` — green
   - [x] Unit tests: \`./banyand/...\`, \`./bydbctl/...\`, \`./pkg/...\`, 
\`./fodc/...\` — green
   - [x] Integration tests: \`./test/integration/standalone/...\`, 
\`./test/integration/distributed/...\` — green
   - [x] Targeted package suites: pkg/fs (21/21 specs, +5 new), trace 22.1s, 
measure 29.3s, stream 17.4s, sidx 3.4s
   - [x] New per-engine \`merger_durability_test.go\` files (4 tests × 4 
engines, 11 cases total) verify atomic metadata commit + leftover-tmp cleanup
   - [x] \`Test_merger_tornSpansBin_stillPanics\` (trace) actively guards the 
canonical panic-on-torn-part contract; captured panic shape: \`cannot read 
data: unexpected EOF\`
   - [ ] e2e (FODC, OAP) — skipped locally; require Kubernetes/Kind. Will run 
via CI.
   
   ## Out of scope
   
   - No conversion of `logger.Panicf("offset N must be equal to bytesRead M", 
...)` sites to error returns.
   - No `recover()` boundary in `mergeParts`. Process exit on torn data is 
preserved.
   - No quarantine of torn parts to a `corrupt/` subdirectory. Operator handles 
via existing snapshot-based orphan cleanup at startup.
   - No new metric. Pod restart count from kubelet remains the operator signal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to