Felix-wave commented on issue #13860: URL: https://github.com/apache/skywalking/issues/13860#issuecomment-4393898679
Adding a second data point from a controlled experiment. We rebuilt both of our SkyWalking deployments **completely from scratch** (PVC deleted, Aliyun ESSD disk released, fresh disk auto-provisioned, BanyanDB 0.10.1 + OAP 10.4.0 starting on an empty `/data`). Details of the experiment are in #13861. The timestamp panic (this issue) **fires immediately** on a brand-new disk, with no accumulated history at all: | Environment | Time elapsed since fresh start | `cannot contain timestamp smaller than` panics | | --- | --- | --- | | KL Prod (high traffic, ~30 services) | 31 min | 7 (≈ 1 every 4.4 min) | | KL Prod | 2 h 6 min | 31 (≈ 1 every 4 min) | | SG Dev (low traffic, dev workload) | 61 min | 12 (≈ 1 every 5 min) | Cadence scales with trace ingestion volume, which is consistent with the original theory in this issue: cross-service segment timestamps within a trace are not strictly monotonic, and `block_writer.go` treats that as a panic-able invariant violation. The recovery interceptor still catches each occurrence, so this isn't crashing the process by itself. But on a busy cluster the log volume from the recovered stack traces is non-trivial, and combined with the unrelated merger crash in #13861 it makes diagnosing other issues harder. Let me know if there's anything specific I can capture (e.g., `tid` distribution across the panics, sample raw segments, agent versions per offending service) that would help. Our agent versions: `apache-skywalking-java-agent` 9.5.0 across all services. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
