Felix-wave commented on issue #13861:
URL: https://github.com/apache/skywalking/issues/13861#issuecomment-4404032233

   Quick update with strong evidence that **PR #1114 (the trace-write timestamp 
panic fix) was the missing piece** for stability on our workload.
   
   ## Setup
   
   I built a custom image that takes `v0.10.x` (which is already at v0.10.2) 
and cherry-picks the `e810640` patch (PR #1114). Then I wiped both `/data` 
directories completely (as recommended by your torn-write analysis) and brought 
up KL Prod on this image.
   
   ```
   banyandb image: skywalking-banyandb v0.10.x + cherry-pick e810640  (PR #1114)
   OAP image:      apache/skywalking-oap-server:10.4.0  (unchanged)
   disk:           wiped clean before start
   ```
   
   ## Result: 12 hours, zero panics, zero restarts
   
   ```
   KL Prod  banyandb (12h after fresh start):
     pod restartCount  = 0
     #13860 timestamp panic                  = 0
     #13861 offset/bytesRead panic           = 0
     total panic events                      = 0
     errors                                  = 0  (excluding the 'watcher 
channel full' info noise)
     data ingested                           = stream 46.6 GB / trace 98.1 GB / 
measure 55.5 MB
   ```
   
   For comparison on the same cluster / workload over similar windows:
   
   | Build | Run length | Pod restarts | #13860 timestamp panic | #13861 offset 
panic |
   | --- | --- | --- | --- | --- |
   | 0.9.0 + OAP 10.3.0 | 17 h | 17 h / ~28 min ≈ 36 (we measured ~406 over 7 
days) | many (process exit) | n/a (different storage engine) |
   | 0.10.1 + OAP 10.4.0 (fresh disk) | 17 h | 41 | many | 41 (process exit) |
   | 0.10.2 + OAP 10.4.0 (fresh disk) | 80 min before SG hit a 
backgroud-goroutine timestamp panic that wasn't caught by gRPC recovery | 1 | 
several | 0 (good — race fix in 0.10.2 worked) |
   | **v0.10.x + e810640 (this build) + fresh disk** | **12 h and counting** | 
**0** | **0** | **0** |
   
   So in our environment **PR #1114 was the load-bearing fix**: 0.10.2's 
race-shutdown fix on its own was enough to suppress #13861 (the merger-loop 
panic) on a clean disk, but **#13860 was still firing from background 
goroutines (introducer/persister/merger of bluge index, measure flush/merge) 
where there is no gRPC recovery interceptor to catch it**, so a single 
timestamp panic in the wrong goroutine still took the process down — that's 
exactly what we saw on SG.
   
   With #1114 landed, those background paths stop generating the panic in the 
first place, and the cluster runs cleanly.
   
   ## Operational confirmation
   
   Functional probe via the OAP GraphQL endpoint while running on this image:
   
   - service registry visible: 23 services including 
`auto-creator-backend-prod`, `data-api-hub-*`, `spider-manager-service-*`
   - `service_resp_time` (avg) returns sensible numbers (~36-68 ms range, in 
line with what we observed on previous storage backends)
   - `service_cpm` returns expected per-minute call counts (111-302 / min)
   
   ## Cherry-pick to v0.10.x?
   
   Given the impact, would it be worth cherry-picking #1114 onto v0.10.x and 
cutting a 0.10.3? We're happy to keep running our own build until upstream 
catches up, but I expect anyone else hitting this on a multi-agent SkyWalking 
deployment will see the same instability — the panic was firing every ~4 
minutes for us before the fix.
   
   Thanks again to you and @wu-sheng for the fast turnaround on this — really 
appreciate the in-depth root-cause work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to