Felix-wave commented on issue #13861:
URL: https://github.com/apache/skywalking/issues/13861#issuecomment-4393897787
Thanks for the quick reply, @hanahmily. To verify the version-mismatch
hypothesis, we ran a controlled experiment with a fully clean state on both of
our environments (KL Prod and SG Dev):
1. Scaled OAP and BanyanDB to 0.
2. **Deleted the PVC entirely**, which (with `reclaimPolicy: Delete` and
Aliyun ESSD CSI) cascaded to PV deletion and physical disk deletion. Confirmed
both Aliyun disks were gone:
- KL: `d-8ps1hxrabeggd4foci5p` → deleted
- SG: `d-t4niiqzwz9f8wke16x97` → deleted
3. Created a fresh PVC of the same size. Storage class `alicloud-disk-essd`
provisioned **brand-new disks** (`d-8ps458950s8goggjqiso` for KL,
`d-t4ngw4m8qhuusgrtvnod` for SG).
4. Scaled BanyanDB 0.10.1 + OAP 10.4.0 back up. Verified `/data` was empty
on first boot and OAP recreated all groups + schemas from scratch.
So at this point there is **no possibility of 0.9.x file residue** — the
disk itself is new, never touched by 0.9.
### Initial observation (looked promising)
For roughly the first 2 hours, the merger panic was absent on both clusters:
```
KL Prod (fresh disk, 2h6m): trace 25.5G / stream 33G #13861 = 0
SG Dev (fresh disk, 1h): trace 166M / stream 5.8G #13861 = 0
```
Notably, KL had already accumulated **25.5 GB of trace data — about 2× the
12.9 GB it had on the old disk when this panic was occurring every ~8 minutes**
— yet zero `offset must be equal to bytesRead` panics during this window.
### After more time, the panic returned
I let both clusters run overnight. ~17 hours after starting fresh:
```
KL Prod (fresh disk, 17h):
banyandb restart=41
most recent panic: "offset 0 must be equal to bytesRead 63954"
pod state: CrashLoopBackOff
disk used: 528 GB
SG Dev (fresh disk, 16h):
banyandb restart=70
most recent panic: "offset 5263206 must be equal to bytesRead 5262950"
disk used: 62 GB
```
So the panic does eventually fire on a brand-new disk; it just takes longer
than ~2 hours to reach whatever condition triggers it. Restart cadence on the
fresh disk in steady state for KL is roughly **once every 25 minutes** (vs ~8
min on the old disk — possibly a function of accumulated state, but still not
stable).
### What this rules out
The "0.9.x file residue + missing version check" hypothesis does not explain
the failure: the failure reproduces on disks that were never written to by
0.9.x.
### What it suggests
The panic appears tied to **runtime state that 0.10.1 itself produces**
(merger output? compaction? specific tag layouts?), not to legacy file format.
Happy to dig further — e.g., capture a part dump just before the next panic if
there's a tool for it, or attach the full panic context with a few hundred
surrounding log lines.
Should I file this as a separate finding, or treat it as a continuation of
this issue? Also, is the version-check process itself worth a separate ticket,
since you mentioned it might not be functioning correctly?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]