hanahmily opened a new issue, #13862:
URL: https://github.com/apache/skywalking/issues/13862

   Spun off from #13861. That issue reports a `panic: offset must be equal to 
bytesRead` from the trace `mergeLoop`, restarting the BanyanDB pod every ~8 
minutes in production. While investigating I found that the panic is 
reproducible from a torn `spans.bin` (or any per-tag `*.t`/`*.tm`) at a block 
boundary, and that the BanyanDB merge write path has a durability gap that 
allows such torn parts to be left on disk and survive across process restarts.
   
   Filing as a separate issue because the fix lives in the BanyanDB storage 
layer (it's not a trace-only or merge-only bug — same gap exists across 
`trace`, `measure`, `stream`, and `sidx`); the user-facing symptom in #13861 
will go away once it lands.
   
   ## Reproduction
   
   A standalone unit test forges the on-disk shape ("`spans.bin` truncated at a 
block boundary, `metadata.json` and `primary.bin` intact") and feeds it back 
into `tst.mergeParts`. The first read of the truncated block silently 
EOF-returns from `seqReader.mustReadFull` without advancing `bytesRead`; the 
*following* block's offset check then fires with the exact production message:
   
   ```
   panic: offset 5300 must be equal to bytesRead 5247
   ```
   
   Same shape as apache/skywalking#13861's `offset 1400877 must be equal to 
bytesRead 1400490`. Reproducer: `Test_merger_tornWriteRepro` in 
`banyand/trace/merger_repro_test.go` (ready to upstream once this issue is 
accepted).
   
   ## Audit of the merge durability path
   
   `banyand/trace/merger.go::mergeParts` (lines 469-525), with cross-references:
   
   | Step | What happens | fsync? |
   |---|---|---|
   | `bw.writers.MustClose()` | Closes SeqWriters for `meta.bin`, 
`primary.bin`, `spans.bin`, plus each `*.t` / `*.tm` | **No.** 
`seqWriter.Close()` (`pkg/fs/local_file_system.go:656-669`) does 
`bufio.Flush()` then `_ = SyncAndDropCache(...)`. The `SyncAndDropCache` return 
value is discarded. On Linux that is `sync_file_range + 
posix_fadvise(DONTNEED)`, which is *not* an `fsync` substitute (no 
inode-metadata sync, no post-crash guarantee for dirty pages). The underlying 
`*os.File` is also never closed in this path. |
   | `pm.mustWriteMetadata(...)` | Writes `metadata.json` | **No.** 
`localFileSystem.Write` (`pkg/fs/local_file_system.go:201-233`) opens, writes, 
defers `file.Close()` — plain `os.File.Close()`, not `LocalFile.Close()`, so no 
`Sync()`. |
   | `tf.mustWriteTraceIDFilter(...)` | Writes `traceID.filter` | **No.** Same 
path. |
   | `tt.mustWriteTagType(...)` | Writes `tag.type` | **No.** Same path. |
   | `fileSystem.SyncPath(dstPath)` | `fsync(dir)` | Yes — but a directory 
fsync persists *names*, not *contents*. |
   
   Failure window: between any data-file `sync_file_range` returning and the 
directory fsync completing, a hard kill (SIGKILL, OOM, eviction, kernel panic) 
leaves the part dir with `metadata.json`/`primary.bin` claiming bytes that 
`spans.bin` (or any tag file) doesn't actually contain. On startup, 
`mustOpenFilePart` reads `metadata.json` and trusts it; the next merge that 
touches the part panics.
   
   ## Why the panic perpetuates
   
   Because the panic in the merge goroutine is unrecovered, the process exits, 
k8s restarts the pod, the merge loop wakes up, picks up the same torn part, 
panics again — every ~8 minutes in the original report.
   
   ## Scope
   
   The same `seqReader.mustReadFull` silent-EOF and the same 
`seqWriter.Close()` no-fsync path are used by all four storage engines that 
share this merge infrastructure:
   
   - `banyand/trace/part_iter.go:359` (this report)
   - `banyand/trace/block.go:196,329`
   - `banyand/internal/sidx/block.go`
   - `banyand/measure/block.go`
   - `banyand/stream/block.go`
   
   One fix to `pkg/fs/local_file_system.go` covers all of them.
   
   ## Proposed fix (four layered changes, each independently correct)
   
   1. **`seqWriter.Close()` must `fsync`, not `sync_file_range`, and propagate 
the error.** `pkg/fs/local_file_system.go:656-669` — replace `_ = 
SyncAndDropCache(...)` with a real `file.Sync()` call whose error fails the 
close. Also explicitly close the underlying `*os.File`.
   2. **`localFileSystem.Write` must `fsync` before `Close`.** 
`pkg/fs/local_file_system.go:201-233` — `metadata.json`, `traceID.filter`, 
`tag.type` are atomic state files and cannot race with their sibling data files 
for durability.
   3. **Two-phase commit for merge metadata.** In `mergeParts` (and the 
matching paths in `measure`, `stream`, `sidx`):
      ```
      bw.writers.MustClose()                    // → fsync each data file 
(after fix #1)
      write metadata.json.tmp → fsync
      write traceID.filter, tag.type → fsync
      rename metadata.json.tmp → metadata.json  // atomic
      SyncPath(dstPath)
      ```
      On startup, parts that have `metadata.json.tmp` *or* `metadata.json` 
missing should be discarded as torn-merge artifacts (move to `corrupt/` and 
emit a metric).
   4. **Convert the six `offset must be equal to bytesRead` panics to errors** 
that propagate up to `mergeBlocks`/`mergeParts`. `mergeParts` already knows how 
to back out a partial output via `removeTracePartOnFailure` / 
`removeSidxPartOnFailure`; the part can then be quarantined. Pair this with a 
`merger_corrupt_part_total` metric and loud logging so the operator can act. 
This is the same shape as option (3) in apache/skywalking#13861.
   
   Together, (1)+(2)+(3) eliminate the torn-write window; (4) makes the merger 
fail-safe even if a torn part somehow survives (e.g., a host-level disk 
failure, an upgrade rollback, manual file copy).
   
   ## Caveat
   
   The reproducer matches the panic shape exactly and the audit confirms the 
durability gap exists, but I have not yet identified what triggers the *very 
first* torn part on a fresh disk in apache/skywalking#13861's setup (the 
reporter sees the first panic ~30 minutes after a clean wipe, with no prior 
process kill). Two diagnostic questions are pending in the original issue. 
Either way, the proposed fix is correct on its own merits — it eliminates the 
catastrophic perpetuation regardless of what causes the initial tear.
   
   ## Reproducer
   
   Will land alongside the fix as 
`banyand/trace/merger_repro_test.go::Test_merger_tornWriteRepro`. Five tests in 
total in that file; four pass cleanly (rule out 
peek/concurrent/heterogeneous-tag hypotheses), one reproduces this issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to