Felix-wave commented on issue #13861:
URL: https://github.com/apache/skywalking/issues/13861#issuecomment-4394533813
Brilliant root-cause work, @hanahmily — the torn-`spans.bin` reproducer
matches our panic shape character-for-character, and the audit in #13862 lines
up with the symptoms (perpetuates across restarts, hits all four storage
engines that share the same merge writer). The two-phase commit + quarantine
plan looks right.
Here are the diagnostics you asked for. Spoiler: I don't think OOMKill or
eviction is the first-cause tear in our setup, but I have an alternative
hypothesis below worth considering.
### 1. Pod exit history
Both clusters' BanyanDB pods exit cleanly via `panic → process exit`, not
SIGKILL or eviction:
```
KL Prod:
restartCount: 67 (over ~21h on the fresh disk)
current state: waiting (CrashLoopBackOff)
lastTerminated: reason=Error exitCode=2 (Go runtime panic, NOT
137/OOMKilled)
pod podStartTime: 2026-05-06T09:35:47Z (uninterrupted; same pod object the
whole run)
SG Dev:
restartCount: 70 (over ~20h)
lastTerminated: reason=Error exitCode=2
pod podStartTime: 2026-05-06T10:41:07Z (uninterrupted)
```
Both pods are still the original ones from the fresh-disk start — same
`metadata.creationTimestamp`, same UID. Just the container has been restarting
in place.
`kubectl get events` is no help for the *first* panic (Kubernetes evicts
events after ~1h by default and we're 21h in), but the recent windows show only
`Unhealthy` (readiness probe vs. a DB that's mid-restart) and `BackOff` — no
`OOMKilling`, no `Evicted`, no `Killing`. Pod-level `restartCount` only counts
container restarts, and on every previous termination the reason has been
`Error` (Go panic), not `OOMKilled`.
### 2. Memory limits + utilization
```
KL Prod banyandb container: requests cpu=2 memory=16Gi limits cpu=8
memory=32Gi
SG Dev banyandb container: requests cpu=1 memory=8Gi limits cpu=4
memory=16Gi
```
Live `kubectl top` snapshot:
```
SG Dev banyandb: 2464m CPU, 4760Mi RSS (≈ 30% of the 16Gi limit, ≈ 60%
of the 8Gi request)
KL Prod banyandb: metrics-server returns "Metrics not available, age:
20h36m" — pod has been
restarting too aggressively for metrics-server to
settle on a sample.
I'll capture this on the next clean restart and update.
```
We don't have a Prometheus retention long enough to cover the first 30-60
min after the wipe; if it would be useful, I can wipe one cluster again and run
`kubectl top --containers` on a 5s loop for the first hour and share the trace.
### 3. An alternative hypothesis for the first tear
Given that exit code 2 is the only termination reason we ever see and
there's no OOMKill/eviction signal, I don't think the first torn part comes
from an external kill. My guess:
> **The timestamp panic from #13860 fires inside the merge goroutine, gets
recovered by an inner gRPC recovery interceptor (not the merger's own
goroutine), and leaves the merge mid-flight — torn `spans.bin` with
`metadata.json` not yet written.**
The reasoning:
- The very first `cannot contain timestamp smaller than` panic on a fresh KL
disk fires within a few minutes of OAP starting to ingest, well before any of
the offset panics.
- Cadence on a fresh KL disk: ~7 timestamp panics in the first 31 min, ~31
in the first 2h, all caught by `grpc-middleware/v2` recovery (no process exit).
Per-environment frequency tracks ingest volume (KL ≈ 4 min/panic at high
traffic, SG ≈ 5 min/panic at dev traffic).
- The first offset panic only shows up after the timestamp panic has fired
many times. After that it self-perpetuates exactly as #13862 describes.
If the timestamp panic ever propagates from a write goroutine into a
merge-write code path (or shares any I/O state with the merge writer's
`seqWriter`s), the torn-write window in the audit is open. This would also
explain why the latency to the *first* offset panic correlates loosely with
traffic: more timestamp panics → more chances to land one mid-merge.
I haven't proven this — it's a hypothesis from observing the panic ordering.
Two ways to discriminate:
a. **Add a panic-counter and timestamp to the merge goroutine.** If
timestamp panics are observed inside `mergeParts` / `mustWriteBlock` (i.e., the
merge writer's own code path, not the gRPC write handler), this hypothesis is
confirmed.
b. **Land #13860's check as a recoverable error in the merge writer too**,
not just a `Panicf`. Independent of #13862, that closes a class of "merge
goroutine raised a panic from inside an `MustWriteBlock` call and damaged its
own output."
### 4. Help we can offer going forward
- I can run an instrumented build on KL Prod whenever you have one ready
(e.g., one that logs `(traceID, offsets, bytesRead, file path)` at every
`seqWriter` open/close and at every panic site). Iteration time is ~25 min
between panics on a fresh disk in KL.
- I can run the standalone reproducer test you described
(`Test_merger_tornWriteRepro` in `banyand/trace/merger_repro_test.go`) locally
to confirm we reproduce the exact same panic shape — please link the branch
when convenient.
- I can take controlled clean wipes on either KL or SG to characterize the
first-tear timing as needed.
In the meantime, given that the issue is now well-understood and a fix is in
flight, we're going to roll back to BanyanDB 0.9.0 + OAP 10.3.0 to restore
monitoring availability. We'll be ready to pick up an 0.10.x release that
includes #13862 whenever it's available, and to verify the fix on the same
workload that exposed it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]