(skywalking-banyandb) 20/50: docs(test): add CLAUDE.md with G5d soak-harness lessons

hanahmily Wed, 13 May 2026 23:12:44 -0700

This is an automated email from the ASF dual-hosted git repository.

hanahmily pushed a commit to branch vectorized-query
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git


commit 141aee43a14add0bf29573e98169f7202ecdec32
Author: Hongtao Gao <[email protected]>
AuthorDate: Sun May 10 09:20:04 2026 +0000

    docs(test): add CLAUDE.md with G5d soak-harness lessons
    
    Distill the eleven-iteration debugging arc from commit abedc3f0 into
    reusable guidance for any future test harness in this repo. Five
    themes: diagnostic visibility, container hygiene on a no-swap host,
    BanyanDB-specific timing/config semantics, fixture strategy, and
    iteration discipline.
    
    Targets the AI-assistant audience that loads CLAUDE.md as context when
    working under test/ — captures the failure modes (flush timing, query
    limit defaults, root path defaults, ms-aligned timestamps, FieldSpec
    encoding requirements, container UID issues, expose vs ports) so the
    next harness setup doesn't burn iterations rediscovering them.
---
 test/CLAUDE.md | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/test/CLAUDE.md b/test/CLAUDE.md
new file mode 100644
index 000000000..7f7891533
--- /dev/null
+++ b/test/CLAUDE.md
@@ -0,0 +1,39 @@
+# Test Infrastructure Guidance
+
+Lessons distilled from setting up the G5d soak harness (commit `abedc3f0` — 
see `docs/soak/g5d-runbook.md` and `scripts/soak-vectorized.sh` for a worked 
example). Apply these to any new harness in this repo: integration suites under 
`test/`, chaos scenarios, perf rigs, soak runs.
+
+## Diagnostic visibility — design the harness so failure leaves evidence
+
+- **Tee the orchestrator log to disk from the very first command.** Put `exec 
> >(tee -a "${DIST}/run.log") 2>&1` immediately after the run dir is created. 
Without this, `set -e` exits leave only the tail of stdout and you debug blind.
+- **Inspect every gRPC response, don't drain it.** When using streaming RPCs 
(e.g. `MeasureService.Write`), check `resp.GetStatus()` against 
`modelv1.Status_STATUS_SUCCEED.String()`. Non-success statuses are not errors 
at the transport layer — they're per-record rejections that disappear if you 
ignore the response.
+- **Default to fail-fast in setup, fail-tolerant in the inner loop.** Setup 
phases (compose up, schema create, fixture seed, baseline record) should `set 
-e` so a 48 h run aborts in 5 minutes when misconfigured. Inner periodic loops 
(pprof grab, parity diff, log scrape) should `|| log WARN` so a single bad 
sample doesn't kill a long-running observation window.
+- **`set -e` + bash arithmetic + multi-line `$()` capture is a silent-failure 
pattern.** `x=$(grep -c PAT file || echo 0)` returns `"0\n0"` when grep finds 
nothing, which `(( x == 0 ))` parses as a syntax error and silently treats as 
false. For anything beyond pure integer comparison, hand off to `python3 -c` or 
set `pipefail` and avoid the `|| echo` fallback.
+
+## Container hygiene on a single-host (no swap, no orchestrator) test box
+
+- **Set explicit `deploy.resources.limits.memory` and `cpus` on every 
service.** A 31 GB no-swap host OOM-kills immediately at the limit; defaults 
will eventually exceed it. Total compose footprint must leave at least 8 GB 
host headroom for the OS and IDE.
+- **Java + SkyWalking agent + Spring Boot demo needs ≥ 1 GB even with 
`-Xmx384m`.** The non-heap JVM overhead (metaspace, code cache, native, 
classloading) routinely exceeds heap. 512 MB caps OOM-kill at startup before 
`-Xmx` is meaningful.
+- **Containers default to root with mode 0700.** If the host shell needs to 
read or copy bind-mounted contents (snapshot/restore between phases), set 
`user: "${SOAK_UID:-1000}:${SOAK_GID:-1000}"` and export both from the 
orchestrator. Without this, `cp -a` from the host silently copies empty subdirs.
+- **`expose:` is for inter-container only; clients on the host need 
`ports:`.** A `connection refused` on `localhost:<port>` from a host-side test 
driver almost always means the port is exposed but not published.
+- **`docker compose down -v` is the correct teardown.** Plain `down` leaves 
volumes; subsequent `up` reuses them and you debug stale state.
+
+## BanyanDB-specific timing and config semantics
+
+These are the things you will assume wrong on first contact. Pin them in any 
harness that talks to BanyanDB:
+
+- **Flush is asynchronous and ~5 s by default.** Both the schema-property 
server (`--schema-server-flush-timeout`) and measure data 
(`--measure-flush-timeout`) buffer writes for up to 5 s before persisting. 
Snapshot-immediately-after-write produces silently-empty snapshots. Either pass 
smaller flush timeouts (`--measure-flush-timeout=500ms` etc.) or have the test 
driver poll-until-queryable instead of sleeping for a fixed delay.
+- **Query `Limit` defaults to 100.** A query against 1000 written rows returns 
100 silently. Always set `Limit` explicitly when the test cares about totals; a 
parity check that "passes" by comparing 100 vs 100 truncations is meaningless.
+- **Write timestamps must be millisecond-aligned.** The validator rejects 
sub-millisecond precision with `STATUS_INVALID_TIMESTAMP`. `time.Now()` has 
nanosecond precision and will be rejected — 
`time.Now().Truncate(time.Millisecond)` first.
+- **Default root paths point at `/tmp`, not at any subdirectory.** 
`--measure-root-path` (and `--stream-root-path`, `--property-root-path`, 
`--trace-root-path`, `--schema-server-root-path`) all default to `/tmp`. The 
quick-start docker-compose mounts `/tmp/banyanDB` and ignores it; the binary 
still writes to `/tmp/measure/...`. For any harness that needs persistence, 
override every root path to your bind-mounted dir.
+- **`FieldSpec` requires both `EncodingMethod` and `CompressionMethod`.** 
Creating a measure programmatically without setting these returns `compression 
method is unspecified`. Use `EncodingMethod_ENCODING_METHOD_GORILLA` + 
`CompressionMethod_COMPRESSION_METHOD_ZSTD` as a safe default for numeric 
fields.
+
+## Fixture strategy
+
+- **Self-seed deterministic data in tests; do not piggyback on OAP-managed 
measures.** OAP renames measures with downsampling suffixes (e.g. 
`service_traffic` becomes `service_traffic_minute` / `_hour` / `_day`) and 
groups them by bucket (`sw_metricsMinute` etc.). Targeting these is fragile 
across OAP versions and depends on traffic timing. A self-seeded `Group + 
Measure + N rows` is ~80 LOC of harness code, deterministic, and decouples the 
test from OAP's schema-install behavior.
+- **Time-bound parity-check queries to a known data window.** If the harness 
keeps writing while the parity loop runs (e.g. background OAP traffic during a 
soak), bound the diff queries to `[T0, T_snapshot]` so subsequent writes don't 
pollute the comparison. The data inside the window stays identical; everything 
outside is invisible to the diff.
+
+## Iteration discipline
+
+- **Add instrumentation before re-running, not after.** A 25 min smoke gives 
you one observation per run. The cheapest path is to add the next probe (status 
logging, fixture verification, intermediate dump) on each iteration, then 
re-run. Shotgun fixes that change three things at once make root-causing the 
next failure impossible.
+- **Budget multiple iterations on first contact.** A multi-component harness 
(compose + Go driver + bash orchestrator + binary protocol) routinely needs ~10 
iterations before passing end-to-end. Document each fix inline (`# Reason: 
<why>`) so the next person doesn't repeat the discovery.
+- **Smoke time should be << real run time.** A SMOKE preset of ~25 min for a 
48 h soak (~30× compression) is the right ratio. If smoke is too fast, it skips 
the timing-sensitive bugs (flush, schema persistence). If it's too slow, you 
can't iterate.

(skywalking-banyandb) 20/50: docs(test): add CLAUDE.md with G5d soak-harness lessons

Reply via email to