mrproliu opened a new pull request, #1193:
URL: https://github.com/apache/skywalking-banyandb/pull/1193
## Why
Daily lifecycle tier-migration (hot→warm→cold) was OOM-killing the
receiving warm/cold data node during external series-index introduction (exit
137), which cascaded into `chunk N: EOF` / `connection
reset` on the sender and `no nodes available` for subsequent groups. The
root cause: bluge's external-segment deduplicator decompressed and retained
every segment's stored fields (~1.8 GB heap per
segment), and because each gRPC sync stream runs its own goroutine with no
shared memory budget, N concurrent senders stacked these materializations and
blew past the cgroup limit.
## What
This PR removes the OOM at the source and adds always-on throttling so an
unexpected memory spike degrades gracefully instead of killing the node. The
bluge bump is the root-cause fix; the throttling
is defense-in-depth that backstops it.
### Root-cause fix
Bump bluge to a memory-efficient external-segment deduplicator that
resolves duplicates via `_id` term-dictionary point lookups
(`Dictionary.Contains`) instead of decompressing and retaining stored
fields. This drops per-segment heap from ~1.8 GB to near-zero during
external-segment introduction, with no public API change and identical dedup
semantics (verified by an upstream differential test).
### Throttling (always on)
The receiver and sender cooperate so the cases that can balloon receiver
memory are bounded:
- **Per-chunk load-shed** — a receiver under high memory pressure
(protector `StateHigh`) answers `SYNC_STATUS_SERVER_BUSY` for incoming
part/series-index chunks; the sender stops streaming, backs
off, and retries the whole part.
- **Introduce back-pressure** — before the heavy `CompleteSegment`
introduce step, the receiver waits (on the gRPC stream context) for memory to
recover, bounded by
`{measure,stream,trace}-lifecycle-receive-mem-wait-timeout` (default 5m).
- **Transient send retry** — the sender wraps pick+send in one bounded
exponential-backoff loop (`pickAndRun`), retrying transient failures (target
node restarting / `no nodes available`, disconnect,
`SERVER_BUSY`) and re-picking a node on a failed send, bounded by
`lifecycle-send-retry-timeout` (default 15m). This covers both the file-chunk
path (all three catalogs) and the row-replay node-pick
path. A part that exhausts its budget is recorded incomplete and resumed
on the next migration cycle.
## Flags & wire
Two new flags are added and documented (`docs/operation/configuration.md`,
`docs/operation/lifecycle.md`):
`{measure,stream,trace}-lifecycle-receive-mem-wait-timeout` (data node, default
5m) and
`lifecycle-send-retry-timeout` (lifecycle command, default 15m). The proto
change is additive and backward-compatible: `SYNC_STATUS_SERVER_BUSY = 8` is
appended to `cluster.v1.SyncStatus`, so peers
that don't recognize it fall back to default handling.
## Testing
A new integration test
(`banyand/measure/migration_oom_throttling_test.go`) exercises the real measure
tier-migration receive path with a test-controlled protector: it writes a real
source tree,
brings up a real target `measure` receiver on a real chunked-sync gRPC
server with an injected protector (a fake that embeds `protector.NewMemory` and
overrides only `State()`), then — protector HIGH
— streams the series-index part and asserts the sender sees
`queue.ErrServerBusy` (the `SERVER_BUSY` round-trip) and the segment is not
introduced, then — protector LOW — asserts (`Eventually`) the
migration completes and the series index lands and is queryable. The
injection is test-only via an `export_test.go` helper; no production code
changes for the test.
Unit tests additionally cover the memory wait (recover / timeout /
parent-cancel), the `queue.ChunkedSyncPartContext` context fallback, the `sub`
BUSY translation, the `pub` BUSY recognition, and the
lifecycle retry classifiers + `pickWithRetry`. Existing migration e2e/copy
suites pass unchanged, and `make build`, `go vet`, and the full-repo `make
lint` (golangci-lint v1.64.8) are clean. This PR
also fixes a pre-existing file-descriptor leak on trace's
`generateAll*PartData` error paths (already-opened handles are released before
returning on a mid-way failure).
- [ ] If this pull request closes/resolves/fixes an existing issue, replace
the issue number. Fixes apache/skywalking#<issue number>.
- [x] Update the [`CHANGES`
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]