Felix-wave opened a new issue, #13861:
URL: https://github.com/apache/skywalking/issues/13861

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no 
similar issues. (Different from the related #13860 — that one is in the trace 
**write** path and is recovered by the gRPC interceptor; this one is in the 
trace **merge/read** path and crashes the process.)
   
   ### Apache SkyWalking Component
   
   BanyanDB
   
   ### What happened
   
   After upgrading from `apache/skywalking-banyandb:0.9.0` to `0.10.1` (with 
OAP 10.4.0), BanyanDB **crashes the process** every ~7-8 minutes with:
   
   ```
   panic: offset 1400877 must be equal to bytesRead 1400490
   ```
   
   Unlike the timestamp-ordering panic in #13860 (which is recovered by 
`grpc-middleware`), this one fires from a **background `mergeLoop` goroutine** 
that is not wrapped by recovery, so the process exits and the pod restarts.
   
   #### Full stack
   
   ```
   goroutine 3900 [running]:
   github.com/apache/skywalking-banyandb/pkg/logger.Panicf(...)
   
github.com/apache/skywalking-banyandb/banyand/trace.(*partMergeIter).mustReadRaw(0xc001ac4000,
 0xc002d716b8, 0xc001ac4118)
       /mnt/d/skywalking-banyandb/banyand/trace/part_iter.go:359 +0xf5
   
github.com/apache/skywalking-banyandb/banyand/trace.(*blockReader).mustReadRaw(...)
       /mnt/d/skywalking-banyandb/banyand/trace/block_reader.go:263
   github.com/apache/skywalking-banyandb/banyand/trace.mergeBlocks(...)
       /mnt/d/skywalking-banyandb/banyand/trace/merger.go:421 +0x79e
   
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeParts(...)
       /mnt/d/skywalking-banyandb/banyand/trace/merger.go:344 +0x42a
   
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergePartsThenSendIntroduction(...)
       /mnt/d/skywalking-banyandb/banyand/trace/merger.go:118 +0x145
   
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeSnapshot(...)
       /mnt/d/skywalking-banyandb/banyand/trace/merger.go:104 +0x125
   
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeLoop.func1(...)
       /mnt/d/skywalking-banyandb/banyand/trace/merger.go:78 +0x1f9
   github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeLoop(...)
       /mnt/d/skywalking-banyandb/banyand/trace/merger.go:90 +0x271
   created by 
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).startLoop in 
goroutine 157
       /mnt/d/skywalking-banyandb/banyand/trace/tstable.go:130 +0x246
   ```
   
   #### Source location (apache/skywalking-banyandb v0.10.1)
   
   `banyand/trace/part_iter.go:354-365`:
   
   ```go
   func (pmi *partMergeIter) mustReadRaw(r *rawBlock, bm *blockMetadata) {
       r.bm = bm
       // spans
       if bm.spans != nil && bm.spans.size > 0 {
           // Validate the reader is aligned to the expected offset
           if bm.spans.offset != pmi.seqReaders.spans.bytesRead {
               logger.Panicf("offset %d must be equal to bytesRead %d", 
bm.spans.offset, pmi.seqReaders.spans.bytesRead)
           }
           ...
       }
       ...
   }
   ```
   
   So the merger sequentially reads spans from `seqReaders.spans`, and a 
per-block `bm.spans.offset` is expected to match how far the `seqReader` has 
advanced (`bytesRead`). When they diverge — by 387 bytes in our sample — the 
merger panics. The same pattern (`offset must be equal to bytesRead`) appears 
at:
   
   - `banyand/trace/block.go:196` (tag metadata)
   - `banyand/trace/block.go:329` (span data)
   - `banyand/internal/sidx/block.go`
   - `banyand/measure/block.go`
   - `banyand/stream/block.go`
   
   So the invariant is repeated across the new (0.10) trace storage engine.
   
   ### Cadence and impact
   
   In our cluster, BanyanDB pod restarted **126 times in 17 hours** = roughly 
once every 8 minutes. Every time, OAP loses connection to BanyanDB and 
hot-loops crash too (~148 OAP restarts in the same window). Net effect: rolling 
availability — every ~8 minutes there is a 1-2 minute window where ingestion 
and queries fail.
   
   For comparison, on 0.9.0 the only panic we saw fired ~once every 28 minutes. 
**0.10.1 is significantly less stable on our workload, primarily because of 
this new panic in the merger.**
   
   ### What you expected to happen
   
   The merger should not panic on what is clearly a corrupted or out-of-sync 
block metadata. Reasonable options (maintainers know best):
   
   1. **Skip the offending block** with a warning instead of `Panicf` — at 
minimum, contain the blast radius to one block instead of restarting the whole 
DB.
   2. **Restart the seqReader** to the offset declared in `bm.spans.offset` (or 
vice versa) when divergence is detected — assumes the metadata is the source of 
truth.
   3. **Fail the merge of the affected part** but keep the process running and 
let retention/cleanup eventually drop the corrupted part.
   
   ### How to reproduce
   
   Steady-state SkyWalking deployment, OAP forwarding traces to standalone 
BanyanDB. We see this on:
   
   - BanyanDB: `apache/skywalking-banyandb:0.10.1`
   - SkyWalking OAP: `apache/skywalking-oap-server:10.4.0`
   - ~30+ Java services, `apache-skywalking-java-agent` 9.5.0, JDK 21
   - Standalone BanyanDB on Kubernetes (Aliyun ACK), 
`--trace-root-path=/data/trace`
   - 51 GB cumulative on disk after 17h of ingest (stream 38.5G + trace 12.9G + 
measure 26M)
   
   The very first occurrence happens within ~30 minutes of starting fresh 
(after fully wiping `/data` and letting OAP recreate schemas). After that, 
panic cadence stabilizes at ~8 minutes.
   
   ### Anything else
   
   This bug is in the new trace storage engine introduced by #713 in 0.10.0; we 
did not see this panic on 0.9.0 (which uses the older trace path).
   
   We have already reported the related — but distinct — timestamp-ordering 
panic in the **write** path as #13860 (recoverable, not crashing the process). 
Filing this one separately because the failure mode (background merge 
goroutine, no recovery, full process exit) is different and arguably more 
disruptive.
   
   Happy to gather more samples (full stack traces over time, sample part dumps 
if a tool exists, sysrq dumps, anything) on request.
   
   ### Are you willing to submit a pull request to fix on your own
   
   - [ ] Yes, I am willing to submit a pull request on my own!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to