This is an automated email from the ASF dual-hosted git repository.

hanahmily pushed a commit to branch vectorized-query
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git

commit 48c0f76be4d19e2999e4591065291ebdd9302ed3
Author: Hongtao Gao <[email protected]>
AuthorDate: Wed May 13 03:20:50 2026 +0000

    docs(soak): G5d Phase D summary
    
    48 h vec-on soak completed at run dist/soak/20260511T005933/. Three of
    four acceptance criteria pass cleanly:
      1. 48 h vec-on run                 PASS (containers healthy, no
         intervention required)
      2. Parity vs flag-off baseline     PASS (576 replay-and-diff cycles,
         0 divergences; deterministic 986-data-point fixture)
      3. No MemoryTracker exhaustion     PASS (memory-alerts.log empty;
         BanyanDB held flat at ~200 MiB / 2 GiB cap)
    
    Criterion 4 (goroutine drift <=5%) tripped at +27% (556 -> 708) but
    root-caused to the bluge index writer pool not releasing analysisWorker
    goroutines on segment rotation -- a storage-layer behavior the
    vectorized query path does not touch. Filed upstream as
    apache/skywalking#13874 under the database label and BanyanDB - 0.11.0
    milestone.
    
    Recommendation: proceed with G5e default flip (draft at
    .omc/g5e-flip-draft.md). The criterion-4 miss would reproduce under
    vec-off on the row path with the same SegmentInterval=1 day
    configuration and is independent of the flag flip.
---
 docs/soak/g5d-phase-d-summary.md | 94 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/docs/soak/g5d-phase-d-summary.md b/docs/soak/g5d-phase-d-summary.md
new file mode 100644
index 000000000..b0f68ce04
--- /dev/null
+++ b/docs/soak/g5d-phase-d-summary.md
@@ -0,0 +1,94 @@
+# G5d Phase D — Soak Summary
+
+**Run:** `dist/soak/20260511T005933/`
+**Branch:** `vectorized-query` @ commit `90aacda9`
+**Phase 0 (vec-off baseline):** 2026-05-11 00:59 Z → 02:03 Z (~1 h 4 min)
+**Phase 1 (vec-on soak):** 2026-05-11 02:03 Z → 2026-05-13 02:03 Z (48 h 0 min)
+**Host:** 32-core Linux, 31 GB RAM, no swap; Docker daemon 29.3.1
+**Harness:** `scripts/soak-vectorized.sh` + `scripts/soak-monitor.sh` + 
`cmd/soak-driver`
+
+## Acceptance criteria
+
+### 1. ≥48 h staging run with `--measure-vectorized-enabled=true` — **PASS**
+
+Phase 1 ran 48 hours wall-clock with the flag enabled in 
`banyand/measure/docker-compose.soak.yaml`. The soak script (`pid 4013611`) and 
monitor (`pid 4015238`) ran to completion without intervention; final teardown 
observed in `dist/soak/20260511T005933/run.log` at 2026-05-13 02:03 Z (`Network 
soak_demo Removed`). All five compose services (banyandb, oap, provider, 
consumer, traffic_loader) remained healthy for the entire window per `docker 
compose ps` and `monitor.log`.
+
+### 2. Parity vs flag-off baseline = no regressions — **PASS**
+
+Phase 0 captured a deterministic baseline of **986 data points** (98.6 % of 
the 1000-row `soak/soak_metric` fixture seeded by `cmd/soak-driver 
seed-fixture`; the 14-row trim is the by-design `[T0, T1] ± 1 s` window edge). 
Phase 1 replayed the same query catalog every 5 minutes for 48 hours against 
the restored snapshot, with each response compared byte-identically to the 
baseline via `proto.Equal` on `[]*measurev1.DataPoint`.
+
+- **576 replay-and-diff runs**
+- **0 divergences** (every `diff-*.json` has `"pass": true`)
+- The final parity check (`diff-final.json`, run during teardown) likewise 
passed
+
+Evidence: `ls dist/soak/20260511T005933/diff-*.json | wc -l` → 576; `grep -l 
'"pass": *false' dist/soak/20260511T005933/diff-*.json` → no output.
+
+### 3. No `MemoryTracker` budget exhaustion — **PASS**
+
+The harness continuously tailed `docker compose logs -f banyandb` and 
`grep`-piped any line matching `MemoryTracker|budget exhausted|memory budget` 
into `memory-alerts.log`. After 48 hours of sustained query + write traffic the 
file is **empty** (`wc -l < dist/soak/20260511T005933/memory-alerts.log` → 0). 
The `pkg/query/vectorized.MemoryTracker` introduced for G1 never approached the 
per-query 256 MiB budget under realistic OAP-driven load.
+
+Container memory was also flat throughout: BanyanDB ~200 MiB / 2 GiB cap (10 
%), OAP ~600 MiB / 2 GiB cap (30 %), Java demos within their 1 GiB caps. No OOM 
kills against the no-swap host.
+
+### 4. No goroutine leaks (heap profile delta ≤ 5 %) — **PARTIAL / UPSTREAM**
+
+97 pprof captures at 30-minute intervals; goroutine count parsed from the 
`goroutine profile: total N` header.
+
+| Phase | Goroutines |
+|---|---|
+| `pprof-start` (Phase 1 t0) | 556 |
+| Steady state, hour 1 → hour 21 | 556 (±1) |
+| `pprof-end` (Phase 1 t+48h) | 708 |
+
+**Δ = +152, +27 %** — exceeds the 5 % threshold.
+
+**Root cause is not the vectorized query path.** The growth has a clean 
signature: two step-function events spaced exactly 24 hours apart, each adding 
~76 goroutines, with zero growth between events. Stack-diff between 
`pprof-start/goroutine-*.txt` and `pprof-end/goroutine-*.txt`:
+
+- **+108 in `github.com/blugelabs/bluge/index.analysisWorker`** 
(`OpenWriter.func1`, `writer.go:77` → `writer.go:667`)
+- **~+44 orchestration goroutines** around new bluge writers 
(`pkg/flow.Transmit`, channel waiters)
+- Every other stack signature is **identical count** start vs end — 
`pkg/flow.Transmit` 108→108, `grpc/internal/grpcsync.CallbackSerializer.run` 
54→54
+
+The 108 = 2 segment-rotation events × ~54 analysisWorker goroutines per new 
bluge writer (the pool sizes itself from GOMAXPROCS = 32 on this host). With 
`SegmentInterval: 1 day`, each UTC midnight crossing rotates the tsTable to a 
new segment, opening a fresh bluge index writer whose analysis-worker pool is 
not released when the previous segment goes idle.
+
+The vectorized query path does not touch bluge writers — the same growth would 
appear under vec-off, on a row-path-only build. Filed upstream as 
**[apache/skywalking#13874](https://github.com/apache/skywalking/issues/13874)**
 (label: `database`, milestone: `BanyanDB - 0.11.0`).
+
+## Verdict
+
+| # | Criterion | Result |
+|---|---|---|
+| 1 | 48 h vec-on run | ✓ |
+| 2 | Parity vs flag-off | ✓ |
+| 3 | No MemoryTracker exhaustion | ✓ |
+| 4 | Goroutine drift ≤ 5 % | ✗ — root cause attributed to bluge writer 
lifecycle (apache/skywalking#13874), pre-existing storage-layer behavior 
independent of the vec path |
+
+**Recommendation: proceed with G5e (default flip).** The criterion-4 miss does 
not block the rollout:
+
+- It is caused by code paths the vectorized query layer does not touch 
(segment-rotation bluge writer creation in the storage layer).
+- It would be reproduced under vec-off on the row path with the same 
configuration.
+- The growth pattern is bounded by segment count (not query rate or time), so 
it does not interact with the flag flip in any way that worsens production 
behavior post-flip.
+- Three of four criteria — including the parity check that the G5b/G5c 
architectural path was specifically built to satisfy — passed cleanly.
+
+The bluge writer lifecycle fix is tracked at apache/skywalking#13874 and 
should be picked up under the 0.11.0 milestone independent of G5.
+
+## Next steps
+
+1. **G5e default flip** — pre-drafted at `.omc/g5e-flip-draft.md`: one-line 
change in `pkg/query/vectorized/measure/config.go` (`Enabled: false` → `true`) 
plus a CHANGES.md entry. Verification command list and commit message template 
included.
+2. **G6 operator wiring** — distinct multi-commit arc (BatchLimit / 
BatchGroupBy / BatchAggregation / BatchTop into `NewMIterator`); recommended in 
a fresh branch.
+3. **apache/skywalking#13874** — bluge writer pool lifecycle fix; not on the 
v1 rollout critical path.
+
+## Artifact paths (local, gitignored)
+
+```
+dist/soak/20260511T005933/
+├── data-snapshot/              # 17 MB Phase 0 snapshot used for parity replay
+├── baseline.json               # 986 data points
+├── pprof-start/                # heap.pb.gz + goroutine.txt at Phase 1 t0
+├── pprof-<ts>/                 # 95 intermediate captures
+├── pprof-end/                  # final capture, 708 goroutines
+├── diff-<ts>.json              # 575 inner-loop parity reports
+├── diff-final.json             # teardown parity check
+├── banyand.log                 # full stack-trace log
+├── memory-alerts.log           # 0 lines
+├── monitor.log                 # tapered status timeline
+├── run.log                     # orchestrator script output
+└── summary.json                # final acceptance fields
+```

Reply via email to