Felix-wave commented on issue #13861:
URL: https://github.com/apache/skywalking/issues/13861#issuecomment-4393897787

   Thanks for the quick reply, @hanahmily. To verify the version-mismatch 
hypothesis, we ran a controlled experiment with a fully clean state on both of 
our environments (KL Prod and SG Dev):
   
   1. Scaled OAP and BanyanDB to 0.
   2. **Deleted the PVC entirely**, which (with `reclaimPolicy: Delete` and 
Aliyun ESSD CSI) cascaded to PV deletion and physical disk deletion. Confirmed 
both Aliyun disks were gone:
      - KL: `d-8ps1hxrabeggd4foci5p` → deleted
      - SG: `d-t4niiqzwz9f8wke16x97` → deleted
   3. Created a fresh PVC of the same size. Storage class `alicloud-disk-essd` 
provisioned **brand-new disks** (`d-8ps458950s8goggjqiso` for KL, 
`d-t4ngw4m8qhuusgrtvnod` for SG).
   4. Scaled BanyanDB 0.10.1 + OAP 10.4.0 back up. Verified `/data` was empty 
on first boot and OAP recreated all groups + schemas from scratch.
   
   So at this point there is **no possibility of 0.9.x file residue** — the 
disk itself is new, never touched by 0.9.
   
   ### Initial observation (looked promising)
   
   For roughly the first 2 hours, the merger panic was absent on both clusters:
   
   ```
   KL Prod  (fresh disk, 2h6m):  trace 25.5G / stream 33G   #13861 = 0
   SG Dev   (fresh disk, 1h):    trace 166M  / stream 5.8G  #13861 = 0
   ```
   
   Notably, KL had already accumulated **25.5 GB of trace data — about 2× the 
12.9 GB it had on the old disk when this panic was occurring every ~8 minutes** 
— yet zero `offset must be equal to bytesRead` panics during this window.
   
   ### After more time, the panic returned
   
   I let both clusters run overnight. ~17 hours after starting fresh:
   
   ```
   KL Prod  (fresh disk, 17h):
     banyandb restart=41
     most recent panic:  "offset 0 must be equal to bytesRead 63954"
     pod state: CrashLoopBackOff
     disk used: 528 GB
   
   SG Dev   (fresh disk, 16h):
     banyandb restart=70
     most recent panic:  "offset 5263206 must be equal to bytesRead 5262950"
     disk used: 62 GB
   ```
   
   So the panic does eventually fire on a brand-new disk; it just takes longer 
than ~2 hours to reach whatever condition triggers it. Restart cadence on the 
fresh disk in steady state for KL is roughly **once every 25 minutes** (vs ~8 
min on the old disk — possibly a function of accumulated state, but still not 
stable).
   
   ### What this rules out
   
   The "0.9.x file residue + missing version check" hypothesis does not explain 
the failure: the failure reproduces on disks that were never written to by 
0.9.x.
   
   ### What it suggests
   
   The panic appears tied to **runtime state that 0.10.1 itself produces** 
(merger output? compaction? specific tag layouts?), not to legacy file format. 
Happy to dig further — e.g., capture a part dump just before the next panic if 
there's a tool for it, or attach the full panic context with a few hundred 
surrounding log lines.
   
   Should I file this as a separate finding, or treat it as a continuation of 
this issue? Also, is the version-check process itself worth a separate ticket, 
since you mentioned it might not be functioning correctly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to