Felix-wave commented on issue #13861:
URL: https://github.com/apache/skywalking/issues/13861#issuecomment-4394533813

   Brilliant root-cause work, @hanahmily — the torn-`spans.bin` reproducer 
matches our panic shape character-for-character, and the audit in #13862 lines 
up with the symptoms (perpetuates across restarts, hits all four storage 
engines that share the same merge writer). The two-phase commit + quarantine 
plan looks right.
   
   Here are the diagnostics you asked for. Spoiler: I don't think OOMKill or 
eviction is the first-cause tear in our setup, but I have an alternative 
hypothesis below worth considering.
   
   ### 1. Pod exit history
   
   Both clusters' BanyanDB pods exit cleanly via `panic → process exit`, not 
SIGKILL or eviction:
   
   ```
   KL Prod:
     restartCount: 67  (over ~21h on the fresh disk)
     current state: waiting (CrashLoopBackOff)
     lastTerminated: reason=Error  exitCode=2  (Go runtime panic, NOT 
137/OOMKilled)
     pod podStartTime: 2026-05-06T09:35:47Z (uninterrupted; same pod object the 
whole run)
   
   SG Dev:
     restartCount: 70  (over ~20h)
     lastTerminated: reason=Error  exitCode=2
     pod podStartTime: 2026-05-06T10:41:07Z (uninterrupted)
   ```
   
   Both pods are still the original ones from the fresh-disk start — same 
`metadata.creationTimestamp`, same UID. Just the container has been restarting 
in place.
   
   `kubectl get events` is no help for the *first* panic (Kubernetes evicts 
events after ~1h by default and we're 21h in), but the recent windows show only 
`Unhealthy` (readiness probe vs. a DB that's mid-restart) and `BackOff` — no 
`OOMKilling`, no `Evicted`, no `Killing`. Pod-level `restartCount` only counts 
container restarts, and on every previous termination the reason has been 
`Error` (Go panic), not `OOMKilled`.
   
   ### 2. Memory limits + utilization
   
   ```
   KL Prod  banyandb container:  requests cpu=2 memory=16Gi   limits cpu=8 
memory=32Gi
   SG Dev   banyandb container:  requests cpu=1 memory=8Gi    limits cpu=4 
memory=16Gi
   ```
   
   Live `kubectl top` snapshot:
   
   ```
   SG Dev   banyandb:   2464m CPU, 4760Mi RSS  (≈ 30% of the 16Gi limit, ≈ 60% 
of the 8Gi request)
   KL Prod  banyandb:   metrics-server returns "Metrics not available, age: 
20h36m" — pod has been
                        restarting too aggressively for metrics-server to 
settle on a sample.
                        I'll capture this on the next clean restart and update.
   ```
   
   We don't have a Prometheus retention long enough to cover the first 30-60 
min after the wipe; if it would be useful, I can wipe one cluster again and run 
`kubectl top --containers` on a 5s loop for the first hour and share the trace.
   
   ### 3. An alternative hypothesis for the first tear
   
   Given that exit code 2 is the only termination reason we ever see and 
there's no OOMKill/eviction signal, I don't think the first torn part comes 
from an external kill. My guess:
   
   > **The timestamp panic from #13860 fires inside the merge goroutine, gets 
recovered by an inner gRPC recovery interceptor (not the merger's own 
goroutine), and leaves the merge mid-flight — torn `spans.bin` with 
`metadata.json` not yet written.**
   
   The reasoning:
   
   - The very first `cannot contain timestamp smaller than` panic on a fresh KL 
disk fires within a few minutes of OAP starting to ingest, well before any of 
the offset panics.
   - Cadence on a fresh KL disk: ~7 timestamp panics in the first 31 min, ~31 
in the first 2h, all caught by `grpc-middleware/v2` recovery (no process exit). 
Per-environment frequency tracks ingest volume (KL ≈ 4 min/panic at high 
traffic, SG ≈ 5 min/panic at dev traffic).
   - The first offset panic only shows up after the timestamp panic has fired 
many times. After that it self-perpetuates exactly as #13862 describes.
   
   If the timestamp panic ever propagates from a write goroutine into a 
merge-write code path (or shares any I/O state with the merge writer's 
`seqWriter`s), the torn-write window in the audit is open. This would also 
explain why the latency to the *first* offset panic correlates loosely with 
traffic: more timestamp panics → more chances to land one mid-merge.
   
   I haven't proven this — it's a hypothesis from observing the panic ordering. 
Two ways to discriminate:
   
   a. **Add a panic-counter and timestamp to the merge goroutine.** If 
timestamp panics are observed inside `mergeParts` / `mustWriteBlock` (i.e., the 
merge writer's own code path, not the gRPC write handler), this hypothesis is 
confirmed.
   b. **Land #13860's check as a recoverable error in the merge writer too**, 
not just a `Panicf`. Independent of #13862, that closes a class of "merge 
goroutine raised a panic from inside an `MustWriteBlock` call and damaged its 
own output."
   
   ### 4. Help we can offer going forward
   
   - I can run an instrumented build on KL Prod whenever you have one ready 
(e.g., one that logs `(traceID, offsets, bytesRead, file path)` at every 
`seqWriter` open/close and at every panic site). Iteration time is ~25 min 
between panics on a fresh disk in KL.
   - I can run the standalone reproducer test you described 
(`Test_merger_tornWriteRepro` in `banyand/trace/merger_repro_test.go`) locally 
to confirm we reproduce the exact same panic shape — please link the branch 
when convenient.
   - I can take controlled clean wipes on either KL or SG to characterize the 
first-tear timing as needed.
   
   In the meantime, given that the issue is now well-understood and a fix is in 
flight, we're going to roll back to BanyanDB 0.9.0 + OAP 10.3.0 to restore 
monitoring availability. We'll be ready to pick up an 0.10.x release that 
includes #13862 whenever it's available, and to verify the fix on the same 
workload that exposed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to