mrproliu opened a new pull request, #1168:
URL: https://github.com/apache/skywalking-banyandb/pull/1168

   ## Background
   
   The lifecycle migration moves data down the storage tiers (hot → warm → 
cold). For
   multi-target groups it uses **row-replay**: it reads each source part and 
re-publishes
   every row through the normal Write API so the target tier re-shards and 
re-indexes it.
   
   This PR fixes two production problems found while migrating real SkyWalking 
metrics data
   (`sw_metricsHour` / `sw_metricsMinute`): the data nodes were **OOM-killed** 
during
   row-replay, and parts with unrecoverable rows caused **hard migration 
failures**.
   
   ## Problem 1 — OOM during row-replay
   
   Migrating large measure parts drove a data node's heap to **~1.5 GB** and 
the pod was
   OOM-killed, stalling the whole migration. Three things compounded:
   
   1. The dump reader materialized a part's column data far beyond a single 
block — the
      decoder's internal buffer grew to hold the whole part's string data.
   2. Every replayed row allocated a **fresh marshal buffer**, so a part with 
millions of
      rows produced millions of short-lived large allocations.
   3. There was **no bound on in-flight batches** — rows were marshaled and 
queued faster
      than they were confirmed, so sent-but-unconfirmed payloads piled up in 
memory.
   
   ### How it's fixed
   
   The row-replay path is rebuilt around **bounded, reusable memory**:
   
   1. **Streaming columnar iterator** — the part is walked one block at a time
      (`IterateColumnar`) using reusable scratch buffers. The decoder is reset 
per block, so
      the decode buffer is bounded to a single block instead of the whole part.
   2. **Size-classed marshal buffer pool** (`row_replay_bufpool.go`) — each row 
borrows one
      reusable buffer grouped by power-of-two size class; the whole batch's 
buffers are
      returned together once `Publish` has put every body on the wire. Grouping 
by size class
      keeps a small body from pinning a large buffer, and lazy reclamation 
drops classes left
      idle for several flushes so a transient burst of large rows is released 
rather than
      hoarded.
   3. **Bounded batch + bounded confirmation pipeline** — a batch flushes when 
it reaches
      either `rowReplayMaxBatchRows` rows or **`rowReplayMaxBatchBytes` (32 
MiB)** of marshaled
      bytes, and a confirmation pipeline caps how many batches may be 
sent-but-unconfirmed
      (`rowReplayMaxInflight`). Together these put a hard ceiling on 
outstanding buffers
      regardless of part size.
   
   **Result:** peak heap drops from **~1.5 GB to ~200–300 MB**, independent of 
part size.
   
   ## Problem 2 — hard failure on unresolvable rows
   
   Some source parts can't have their rows republished because the entity 
identity can't be
   recovered:
   
   - **`rebuild-failed`** — the series-index (sidx) has a gap (no 
`EntityValues` for a
     `seriesID`); the part's entity columns are present but no candidate's 
recomputed hash
     matches, so the entity can't be rebuilt.
   - **`incomplete-part`** — the part carries no entity tag columns at all, so 
neither sidx
     nor column rebuild can recover the entity.
   
   Previously this **hard-failed** the part/segment migration.
   
   ### How it's handled
   
   These rows are now **skipped and retained** instead of failing the migration:
   
   - The unresolved rows are skipped; the rest of the part migrates normally 
and the part is
     marked complete.
   - A structured error is recorded with a bounded sample of the skipped series 
(part,
     `seriesID`, reason) so an operator can locate the exact source part — not 
just a count.
   - The **source segment is retained** (excluded from post-migration 
deletion), so skipped
     rows are never lost — they stay on the source tier and can still be 
queried.
   
   ## Error reporting consolidation (related)
   
   All migration errors are consolidated into a single flat list of 
**structured, stage-aware
   entries** (`source_stage` / `target_stage` / `group` / `catalog` / `scope` / 
`segment` /
   `shard` / `part` / `interval`), replacing the previous nested error maps. 
The report still
   exposes them under named buckets, and the `summary.*.errors` counts derive 
from the same
   single source.
   
   ## Validation (real cluster, full hot→warm→cold migration)
   
   - **Migration: 288/288 parts, 100% complete, 0 hard failures, 0 node-level 
send errors.**
     ~40M rows replayed.
   - **Memory: 0 OOM-kills, 0 restarts** across all data/lifecycle containers 
during the
     migration window. The active row-replay process peaked at **~205 MiB** (vs 
~1.5 GB
     before).
   - **Skip-and-retain: ~1.1M rows** skipped as `series-index gap / 
incomplete-part`, each
     recorded as a structured error with the source segment retained.
   - **Data is queryable** after migration via the gRPC `MeasureService/Query`: 
old time
     ranges return **0 on `hot`** (correctly migrated out) and **full results 
on `warm` and
     `cold`**.
   
   - [ ] If this pull request closes/resolves/fixes an existing issue, replace 
the issue number. Fixes apache/skywalking#<issue number>.
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to