mrproliu opened a new pull request, #1154:
URL: https://github.com/apache/skywalking-banyandb/pull/1154
## Why
Tiered-storage lifecycle migration falls back to **row-replay**
(re-publishing each row through the Write API) for any source part whose
segment spans more than one target segment — the case for non-multiple/coprime
stage intervals (e.g.
`sw_metricsHour` hot **5d** → warm **7d**, where a `[05-27, 06-01)`
segment straddles the warm `05-28` boundary).
For large parts this failed with:
```
code = DeadlineExceeded desc = context deadline exceeded
... confirm row-replay measure part .../seg-20260527/shard-0/...: 1 node
error(s)
file-based measure migration failed ... measure parts incomplete
```
**Root cause:** row-replay sent the *entire part* on a single
client-streaming batch publisher. That stream's context deadline (the `30s`
batch timeout) was set once at stream-open and had to cover the whole part's
send. A real part of
**425,940 rows** takes ~42s to build/route/marshal/publish, so the single
30s window expired mid-part → the receiver returned `DeadlineExceeded` → the
part was marked incomplete and the group migration "partially completed",
retrying
with the same failure every cycle.
## Solution
Reworked row-replay so the timeout is **per batch**, not per part, and
extracted the machinery into one shared layer:
- **Per-batch timeout window** — each 2000-row batch is published on its
*own* publisher with a fresh timeout and rotated in immediately. No single
deadline ever spans the whole part, so arbitrarily large parts complete.
- **Bounded confirm pipeline (depth 2)** — a batch's receiver-side
confirmation overlaps with building/sending the next batch (double-buffering),
cutting wall time while capping in-flight memory.
- **Shared `batchSender` + `confirmPipeline`** — the per-type replayers
(measure/stream/trace) now just build messages and enqueue; the publisher
lifecycle (open/rotate/close), bounded overlap, and confirmation ordering live
in one place
(removes ~350 lines of triplicated logic).
- **Standard `error` instead of a custom outcome struct** — replayers
return a plain `error` (`*nodeReplayError`) that still distinguishes per-node
delivery failures (`cee`) from global send/build failures.
- **Per-part timing instrumentation** — each part logs `build_send` vs
`confirm_wait` to pinpoint sender- vs receiver-bound migrations.
On abort, in-flight batches are still drained and the un-flushed residual
is discarded, so a failed part is re-replayed whole on resume (no
duplicate/partial commits).
## Validation
Restored real production data and ran the migration end-to-end:
- `sw_metricsHour` hot→warm: **23 parts row-replayed successfully, 0
`DeadlineExceeded`**, incl. part `5820` (**425,940 rows, ~41.8s**) and part
`17658` (**484,798 rows, ~61.5s**) — both far beyond the old 30s limit.
- Timing showed `build_send ≈ 41.7s` vs `confirm_wait ≈ 0.1s`, i.e.
sender-bound (the pipeline keeps the receiver from being the bottleneck).
## Tests
Added unit tests covering the pipeline contract: in-flight depth bound,
per-batch publisher/timeout rotation, every-row-sent across batch boundaries +
tail, abort-drains-and-discards (build error / iterator error / node error),
partial-failure reporting, and the `error` type's message/unwrap behavior.
`make lint` and the package `-race` suite are green.
- [ ] If this pull request closes/resolves/fixes an existing issue, replace
the issue number. Fixes apache/skywalking#<issue number>.
- [ ] Update the [`CHANGES`
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]