mrproliu opened a new pull request, #1193:
URL: https://github.com/apache/skywalking-banyandb/pull/1193

   ## Why
   
     Daily lifecycle tier-migration (hot→warm→cold) was OOM-killing the 
receiving warm/cold data node during external series-index introduction (exit 
137), which cascaded into `chunk N: EOF` / `connection
     reset` on the sender and `no nodes available` for subsequent groups. The 
root cause: bluge's external-segment deduplicator decompressed and retained 
every segment's stored fields (~1.8 GB heap per
     segment), and because each gRPC sync stream runs its own goroutine with no 
shared memory budget, N concurrent senders stacked these materializations and 
blew past the cgroup limit.
   
     ## What
   
     This PR removes the OOM at the source and adds always-on throttling so an 
unexpected memory spike degrades gracefully instead of killing the node. The 
bluge bump is the root-cause fix; the throttling
     is defense-in-depth that backstops it.
   
     ### Root-cause fix
   
     Bump bluge to a memory-efficient external-segment deduplicator that 
resolves duplicates via `_id` term-dictionary point lookups 
(`Dictionary.Contains`) instead of decompressing and retaining stored
     fields. This drops per-segment heap from ~1.8 GB to near-zero during 
external-segment introduction, with no public API change and identical dedup 
semantics (verified by an upstream differential test).
   
     ### Throttling (always on)
   
     The receiver and sender cooperate so the cases that can balloon receiver 
memory are bounded:
   
     - **Per-chunk load-shed** — a receiver under high memory pressure 
(protector `StateHigh`) answers `SYNC_STATUS_SERVER_BUSY` for incoming 
part/series-index chunks; the sender stops streaming, backs
     off, and retries the whole part.
     - **Introduce back-pressure** — before the heavy `CompleteSegment` 
introduce step, the receiver waits (on the gRPC stream context) for memory to 
recover, bounded by
     `{measure,stream,trace}-lifecycle-receive-mem-wait-timeout` (default 5m).
     - **Transient send retry** — the sender wraps pick+send in one bounded 
exponential-backoff loop (`pickAndRun`), retrying transient failures (target 
node restarting / `no nodes available`, disconnect,
     `SERVER_BUSY`) and re-picking a node on a failed send, bounded by 
`lifecycle-send-retry-timeout` (default 15m). This covers both the file-chunk 
path (all three catalogs) and the row-replay node-pick
     path. A part that exhausts its budget is recorded incomplete and resumed 
on the next migration cycle.
   
     ## Flags & wire
   
     Two new flags are added and documented (`docs/operation/configuration.md`, 
`docs/operation/lifecycle.md`): 
`{measure,stream,trace}-lifecycle-receive-mem-wait-timeout` (data node, default 
5m) and
     `lifecycle-send-retry-timeout` (lifecycle command, default 15m). The proto 
change is additive and backward-compatible: `SYNC_STATUS_SERVER_BUSY = 8` is 
appended to `cluster.v1.SyncStatus`, so peers
     that don't recognize it fall back to default handling.
   
     ## Testing
   
     A new integration test 
(`banyand/measure/migration_oom_throttling_test.go`) exercises the real measure 
tier-migration receive path with a test-controlled protector: it writes a real 
source tree,
     brings up a real target `measure` receiver on a real chunked-sync gRPC 
server with an injected protector (a fake that embeds `protector.NewMemory` and 
overrides only `State()`), then — protector HIGH
     — streams the series-index part and asserts the sender sees 
`queue.ErrServerBusy` (the `SERVER_BUSY` round-trip) and the segment is not 
introduced, then — protector LOW — asserts (`Eventually`) the
     migration completes and the series index lands and is queryable. The 
injection is test-only via an `export_test.go` helper; no production code 
changes for the test.
   
     Unit tests additionally cover the memory wait (recover / timeout / 
parent-cancel), the `queue.ChunkedSyncPartContext` context fallback, the `sub` 
BUSY translation, the `pub` BUSY recognition, and the
     lifecycle retry classifiers + `pickWithRetry`. Existing migration e2e/copy 
suites pass unchanged, and `make build`, `go vet`, and the full-repo `make 
lint` (golangci-lint v1.64.8) are clean. This PR
     also fixes a pre-existing file-descriptor leak on trace's 
`generateAll*PartData` error paths (already-opened handles are released before 
returning on a mid-way failure).
   
   - [ ] If this pull request closes/resolves/fixes an existing issue, replace 
the issue number. Fixes apache/skywalking#<issue number>.
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to