hanahmily opened a new issue, #13862:
URL: https://github.com/apache/skywalking/issues/13862
Spun off from #13861. That issue reports a `panic: offset must be equal to
bytesRead` from the trace `mergeLoop`, restarting the BanyanDB pod every ~8
minutes in production. While investigating I found that the panic is
reproducible from a torn `spans.bin` (or any per-tag `*.t`/`*.tm`) at a block
boundary, and that the BanyanDB merge write path has a durability gap that
allows such torn parts to be left on disk and survive across process restarts.
Filing as a separate issue because the fix lives in the BanyanDB storage
layer (it's not a trace-only or merge-only bug — same gap exists across
`trace`, `measure`, `stream`, and `sidx`); the user-facing symptom in #13861
will go away once it lands.
## Reproduction
A standalone unit test forges the on-disk shape ("`spans.bin` truncated at a
block boundary, `metadata.json` and `primary.bin` intact") and feeds it back
into `tst.mergeParts`. The first read of the truncated block silently
EOF-returns from `seqReader.mustReadFull` without advancing `bytesRead`; the
*following* block's offset check then fires with the exact production message:
```
panic: offset 5300 must be equal to bytesRead 5247
```
Same shape as apache/skywalking#13861's `offset 1400877 must be equal to
bytesRead 1400490`. Reproducer: `Test_merger_tornWriteRepro` in
`banyand/trace/merger_repro_test.go` (ready to upstream once this issue is
accepted).
## Audit of the merge durability path
`banyand/trace/merger.go::mergeParts` (lines 469-525), with cross-references:
| Step | What happens | fsync? |
|---|---|---|
| `bw.writers.MustClose()` | Closes SeqWriters for `meta.bin`,
`primary.bin`, `spans.bin`, plus each `*.t` / `*.tm` | **No.**
`seqWriter.Close()` (`pkg/fs/local_file_system.go:656-669`) does
`bufio.Flush()` then `_ = SyncAndDropCache(...)`. The `SyncAndDropCache` return
value is discarded. On Linux that is `sync_file_range +
posix_fadvise(DONTNEED)`, which is *not* an `fsync` substitute (no
inode-metadata sync, no post-crash guarantee for dirty pages). The underlying
`*os.File` is also never closed in this path. |
| `pm.mustWriteMetadata(...)` | Writes `metadata.json` | **No.**
`localFileSystem.Write` (`pkg/fs/local_file_system.go:201-233`) opens, writes,
defers `file.Close()` — plain `os.File.Close()`, not `LocalFile.Close()`, so no
`Sync()`. |
| `tf.mustWriteTraceIDFilter(...)` | Writes `traceID.filter` | **No.** Same
path. |
| `tt.mustWriteTagType(...)` | Writes `tag.type` | **No.** Same path. |
| `fileSystem.SyncPath(dstPath)` | `fsync(dir)` | Yes — but a directory
fsync persists *names*, not *contents*. |
Failure window: between any data-file `sync_file_range` returning and the
directory fsync completing, a hard kill (SIGKILL, OOM, eviction, kernel panic)
leaves the part dir with `metadata.json`/`primary.bin` claiming bytes that
`spans.bin` (or any tag file) doesn't actually contain. On startup,
`mustOpenFilePart` reads `metadata.json` and trusts it; the next merge that
touches the part panics.
## Why the panic perpetuates
Because the panic in the merge goroutine is unrecovered, the process exits,
k8s restarts the pod, the merge loop wakes up, picks up the same torn part,
panics again — every ~8 minutes in the original report.
## Scope
The same `seqReader.mustReadFull` silent-EOF and the same
`seqWriter.Close()` no-fsync path are used by all four storage engines that
share this merge infrastructure:
- `banyand/trace/part_iter.go:359` (this report)
- `banyand/trace/block.go:196,329`
- `banyand/internal/sidx/block.go`
- `banyand/measure/block.go`
- `banyand/stream/block.go`
One fix to `pkg/fs/local_file_system.go` covers all of them.
## Proposed fix (four layered changes, each independently correct)
1. **`seqWriter.Close()` must `fsync`, not `sync_file_range`, and propagate
the error.** `pkg/fs/local_file_system.go:656-669` — replace `_ =
SyncAndDropCache(...)` with a real `file.Sync()` call whose error fails the
close. Also explicitly close the underlying `*os.File`.
2. **`localFileSystem.Write` must `fsync` before `Close`.**
`pkg/fs/local_file_system.go:201-233` — `metadata.json`, `traceID.filter`,
`tag.type` are atomic state files and cannot race with their sibling data files
for durability.
3. **Two-phase commit for merge metadata.** In `mergeParts` (and the
matching paths in `measure`, `stream`, `sidx`):
```
bw.writers.MustClose() // → fsync each data file
(after fix #1)
write metadata.json.tmp → fsync
write traceID.filter, tag.type → fsync
rename metadata.json.tmp → metadata.json // atomic
SyncPath(dstPath)
```
On startup, parts that have `metadata.json.tmp` *or* `metadata.json`
missing should be discarded as torn-merge artifacts (move to `corrupt/` and
emit a metric).
4. **Convert the six `offset must be equal to bytesRead` panics to errors**
that propagate up to `mergeBlocks`/`mergeParts`. `mergeParts` already knows how
to back out a partial output via `removeTracePartOnFailure` /
`removeSidxPartOnFailure`; the part can then be quarantined. Pair this with a
`merger_corrupt_part_total` metric and loud logging so the operator can act.
This is the same shape as option (3) in apache/skywalking#13861.
Together, (1)+(2)+(3) eliminate the torn-write window; (4) makes the merger
fail-safe even if a torn part somehow survives (e.g., a host-level disk
failure, an upgrade rollback, manual file copy).
## Caveat
The reproducer matches the panic shape exactly and the audit confirms the
durability gap exists, but I have not yet identified what triggers the *very
first* torn part on a fresh disk in apache/skywalking#13861's setup (the
reporter sees the first panic ~30 minutes after a clean wipe, with no prior
process kill). Two diagnostic questions are pending in the original issue.
Either way, the proposed fix is correct on its own merits — it eliminates the
catastrophic perpetuation regardless of what causes the initial tear.
## Reproducer
Will land alongside the fix as
`banyand/trace/merger_repro_test.go::Test_merger_tornWriteRepro`. Five tests in
total in that file; four pass cleanly (rule out
peek/concurrent/heterogeneous-tag hypotheses), one reproduces this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail:
[email protected]
For queries about this service, please contact Infrastructure at:
[email protected]