Re: [PR] feat: post-trace pipeline design — tail-sampling + per-stage rules [skywalking-banyandb]

via GitHub Thu, 28 May 2026 00:59:33 -0700


Copilot commented on code in PR #1144:
URL: 
https://github.com/apache/skywalking-banyandb/pull/1144#discussion_r3316216950



##########
api/proto/banyandb/pipeline/v1/trace_pipeline.proto:
##########
@@ -0,0 +1,187 @@
+// Licensed to Apache Software Foundation (ASF) under one or more contributor
+// license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright
+// ownership. Apache Software Foundation (ASF) licenses this file to you under
+// the Apache License, Version 2.0 (the "License"); you may
+// not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto3";
+
+package banyandb.pipeline.v1;
+
+import "banyandb/common/v1/common.proto";
+import "google/protobuf/duration.proto";
+import "validate/validate.proto";
+
+option go_package = 
"github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1";
+option java_package = "org.apache.skywalking.banyandb.pipeline.v1";
+
+// StageEvent enumerates the lifecycle events of a stage at which the filter 
runs.
+enum StageEvent {
+  STAGE_EVENT_UNSPECIFIED = 0;
+  // In-line filter during an LSM merge within a stage.
+  STAGE_EVENT_COMPACTION = 1;
+  // Scheduled final filter once a segment has settled: its event-time window
+  // has closed and the watermark has advanced past it by a grace period.
+  // (BanyanDB has no hard "seal" event; settling is watermark-driven.)
+  STAGE_EVENT_FINALIZE = 2;
+  // Pre-migration rewrite when a segment leaves this stage for the next.
+  STAGE_EVENT_MIGRATION_OUT = 3;
+}
+
+// TracePipelineConfig is the root configuration for a storage-node trace 
pipeline.
+// It reuses existing catalog identifiers (group via metadata, stage names, 
schema
+// names) for targeting instead of declaring a parallel metadata model.
+message TracePipelineConfig {
+  // Identity and revision tracking; metadata.group is the Group this pipeline
+  // lives in and applies to, consistent with every other schema resource.
+  common.v1.Metadata metadata = 1;
+  // Active status of the pipeline.
+  bool enabled = 2;
+  // Per-stage retention rules: which lifecycle stages this pipeline acts on,
+  // with the keep-predicates and lifecycle events for each. Empty means the
+  // tail-sampling gate at finalization is the only filter; no per-stage drop
+  // is applied.
+  repeated StageRule stages = 3;
+  // Explicit schema names to target within the Group (exact match on 
Metadata.name).
+  repeated string schema_names = 4;
+  // RE2 regular expression matched against schema names. A schema is targeted 
if it
+  // is listed in schema_names OR matches this pattern. When both are empty, 
every
+  // schema in the Group is targeted.
+  string schema_name_regex = 5;
+  // Tail-sampling (gating) rules applied at finalization (§8.3) to decide
+  // whether a trace is retained at all once its segment has settled.
+  TailSampling tail_sampling = 6;
+  // Per-trace maturity window for the in-line merge filter (§8.1). A trace is
+  // eligible for dropping during an LSM compaction merge only once its latest
+  // span timestamp is older than `now - merge_grace`; younger traces pass
+  // through the merge unchanged. Bounds the expected intra-trace span arrival
+  // spread (typically seconds). Must be strictly positive if set; if unset, 
the
+  // engine uses a documented default (currently 30s).
+  google.protobuf.Duration merge_grace = 7 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+  // Per-segment settling window for the scheduled finalization pass (§8.3). A
+  // segment is treated as settled, and the authoritative final filter runs,
+  // once the event-time watermark exceeds `segment.End + finalize_grace`.
+  // Bounds segment-wide late arrival (typically minutes). Must be strictly
+  // positive if set; if unset, the engine uses a documented default
+  // (currently 5m). Distinct from `merge_grace`: `merge_grace` is per-trace,
+  // `finalize_grace` is per-segment.
+  google.protobuf.Duration finalize_grace = 8 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+}
+
+// StageRule binds the pipeline to one lifecycle stage of the targeted Group
+// and declares per-stage retention predicates.
+//
+// Semantics (when the rule fires at one of its `apply_at` events):
+//   - If every predicate field is unset (no `min_duration`, no `keep_errors`,
+//     and an empty `keep_tag_rules`) the rule has no filtering effect — every
+//     trace at this stage is retained, and the tail-sampling gate at
+//     finalization remains the only filter affecting traces here.
+//   - Otherwise the *set* predicates are combined with OR: a trace is retained
+//     if any single set predicate matches it; a trace that fails all set
+//     predicates is dropped from this stage at the firing event.
+//
+// Within `keep_tag_rules`, the individual `TagMatcher` entries are likewise
+// OR-combined (any matcher match satisfies the keep_tag_rules predicate).
+message StageRule {
+  // Stage name from the Group's ResourceOpts.stages (e.g. "hot", "warm", 
"cold").
+  string stage = 1;
+  // Which of this stage's lifecycle events run the filter. Empty means all of

Review Comment:
   Several string identifier fields are currently unconstrained, allowing empty 
stage names or tag keys. Elsewhere in the repo, identifier strings are 
validated with `string.min_len = 1` (e.g., `common.v1.LifecycleStage.name`). 
Adding the same validations here prevents configs that can't be meaningfully 
applied.
   



##########
api/proto/banyandb/pipeline/v1/trace_pipeline.proto:
##########
@@ -0,0 +1,187 @@
+// Licensed to Apache Software Foundation (ASF) under one or more contributor
+// license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright
+// ownership. Apache Software Foundation (ASF) licenses this file to you under
+// the Apache License, Version 2.0 (the "License"); you may
+// not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto3";
+
+package banyandb.pipeline.v1;
+
+import "banyandb/common/v1/common.proto";
+import "google/protobuf/duration.proto";
+import "validate/validate.proto";
+
+option go_package = 
"github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1";
+option java_package = "org.apache.skywalking.banyandb.pipeline.v1";
+
+// StageEvent enumerates the lifecycle events of a stage at which the filter 
runs.
+enum StageEvent {
+  STAGE_EVENT_UNSPECIFIED = 0;
+  // In-line filter during an LSM merge within a stage.
+  STAGE_EVENT_COMPACTION = 1;
+  // Scheduled final filter once a segment has settled: its event-time window
+  // has closed and the watermark has advanced past it by a grace period.
+  // (BanyanDB has no hard "seal" event; settling is watermark-driven.)
+  STAGE_EVENT_FINALIZE = 2;
+  // Pre-migration rewrite when a segment leaves this stage for the next.
+  STAGE_EVENT_MIGRATION_OUT = 3;
+}
+
+// TracePipelineConfig is the root configuration for a storage-node trace 
pipeline.
+// It reuses existing catalog identifiers (group via metadata, stage names, 
schema
+// names) for targeting instead of declaring a parallel metadata model.
+message TracePipelineConfig {
+  // Identity and revision tracking; metadata.group is the Group this pipeline
+  // lives in and applies to, consistent with every other schema resource.
+  common.v1.Metadata metadata = 1;

Review Comment:
   `TracePipelineConfig.metadata` is the identity field for this resource, but 
it isn't marked as required like other BanyanDB schema resources. Without 
`message.required`, an empty config can be accepted by PGV even though the 
server will need `metadata` for name/group/revision handling.
   



##########
api/proto/banyandb/pipeline/v1/trace_pipeline.proto:
##########
@@ -0,0 +1,187 @@
+// Licensed to Apache Software Foundation (ASF) under one or more contributor
+// license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright
+// ownership. Apache Software Foundation (ASF) licenses this file to you under
+// the Apache License, Version 2.0 (the "License"); you may
+// not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto3";
+
+package banyandb.pipeline.v1;
+
+import "banyandb/common/v1/common.proto";
+import "google/protobuf/duration.proto";
+import "validate/validate.proto";
+
+option go_package = 
"github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1";
+option java_package = "org.apache.skywalking.banyandb.pipeline.v1";
+
+// StageEvent enumerates the lifecycle events of a stage at which the filter 
runs.
+enum StageEvent {
+  STAGE_EVENT_UNSPECIFIED = 0;
+  // In-line filter during an LSM merge within a stage.
+  STAGE_EVENT_COMPACTION = 1;
+  // Scheduled final filter once a segment has settled: its event-time window
+  // has closed and the watermark has advanced past it by a grace period.
+  // (BanyanDB has no hard "seal" event; settling is watermark-driven.)
+  STAGE_EVENT_FINALIZE = 2;
+  // Pre-migration rewrite when a segment leaves this stage for the next.
+  STAGE_EVENT_MIGRATION_OUT = 3;
+}
+
+// TracePipelineConfig is the root configuration for a storage-node trace 
pipeline.
+// It reuses existing catalog identifiers (group via metadata, stage names, 
schema
+// names) for targeting instead of declaring a parallel metadata model.
+message TracePipelineConfig {
+  // Identity and revision tracking; metadata.group is the Group this pipeline
+  // lives in and applies to, consistent with every other schema resource.
+  common.v1.Metadata metadata = 1;
+  // Active status of the pipeline.
+  bool enabled = 2;
+  // Per-stage retention rules: which lifecycle stages this pipeline acts on,
+  // with the keep-predicates and lifecycle events for each. Empty means the
+  // tail-sampling gate at finalization is the only filter; no per-stage drop
+  // is applied.
+  repeated StageRule stages = 3;
+  // Explicit schema names to target within the Group (exact match on 
Metadata.name).
+  repeated string schema_names = 4;
+  // RE2 regular expression matched against schema names. A schema is targeted 
if it
+  // is listed in schema_names OR matches this pattern. When both are empty, 
every
+  // schema in the Group is targeted.
+  string schema_name_regex = 5;
+  // Tail-sampling (gating) rules applied at finalization (§8.3) to decide
+  // whether a trace is retained at all once its segment has settled.
+  TailSampling tail_sampling = 6;
+  // Per-trace maturity window for the in-line merge filter (§8.1). A trace is
+  // eligible for dropping during an LSM compaction merge only once its latest
+  // span timestamp is older than `now - merge_grace`; younger traces pass
+  // through the merge unchanged. Bounds the expected intra-trace span arrival
+  // spread (typically seconds). Must be strictly positive if set; if unset, 
the
+  // engine uses a documented default (currently 30s).
+  google.protobuf.Duration merge_grace = 7 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+  // Per-segment settling window for the scheduled finalization pass (§8.3). A
+  // segment is treated as settled, and the authoritative final filter runs,
+  // once the event-time watermark exceeds `segment.End + finalize_grace`.
+  // Bounds segment-wide late arrival (typically minutes). Must be strictly
+  // positive if set; if unset, the engine uses a documented default
+  // (currently 5m). Distinct from `merge_grace`: `merge_grace` is per-trace,
+  // `finalize_grace` is per-segment.
+  google.protobuf.Duration finalize_grace = 8 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+}
+
+// StageRule binds the pipeline to one lifecycle stage of the targeted Group
+// and declares per-stage retention predicates.
+//
+// Semantics (when the rule fires at one of its `apply_at` events):
+//   - If every predicate field is unset (no `min_duration`, no `keep_errors`,
+//     and an empty `keep_tag_rules`) the rule has no filtering effect — every
+//     trace at this stage is retained, and the tail-sampling gate at
+//     finalization remains the only filter affecting traces here.
+//   - Otherwise the *set* predicates are combined with OR: a trace is retained
+//     if any single set predicate matches it; a trace that fails all set
+//     predicates is dropped from this stage at the firing event.
+//
+// Within `keep_tag_rules`, the individual `TagMatcher` entries are likewise
+// OR-combined (any matcher match satisfies the keep_tag_rules predicate).
+message StageRule {
+  // Stage name from the Group's ResourceOpts.stages (e.g. "hot", "warm", 
"cold").
+  string stage = 1;
+  // Which of this stage's lifecycle events run the filter. Empty means all of
+  // them (compaction merges, segment finalization, and outbound migration).
+  repeated StageEvent apply_at = 2;
+  // Minimum total trace duration to retain at this stage. A trace whose
+  // duration is at least min_duration is retained. Strictly positive if set;
+  // if unset, duration is not a retention factor for this stage.
+  google.protobuf.Duration min_duration = 3 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+  // If true, traces with any error span are retained at this stage regardless
+  // of duration or tag predicates.
+  bool keep_errors = 4;
+  // Sure-keep tag matchers: a trace satisfies this predicate when ANY of the
+  // matchers matches any of its spans (matchers are OR-combined). Distinct
+  // from TailSampling.tag_rules (which is for ingest-time gating with a
+  // sample_rate); TagMatcher has no sample_rate — a match is an absolute keep.
+  repeated TagMatcher keep_tag_rules = 5;
+}
+
+// TailSampling manages tail-sampling (ingest-gating) decisions at 
finalization.
+message TailSampling {
+  // Absolute trace duration threshold. Traces whose total latency meets or
+  // exceeds this value are preserved (a sure-keep at gating). Required and
+  // strictly positive.
+  google.protobuf.Duration duration_threshold = 1 [(validate.rules).duration = 
{
+    required: true
+    gt: {seconds: 0}
+  }];
+  // Immediate retention rule when any span contains an error state.
+  bool keep_all_errors = 2;
+  // Probabilistic keep rate for healthy, low-latency traces that match no
+  // sure-keep rule. The decision is a deterministic hash(trace_id) < rate, so
+  // it is stable across re-evaluation. Must be in [0.0, 1.0].
+  double healthy_sample_rate = 3 [(validate.rules).double = {
+    gte: 0.0
+    lte: 1.0
+  }];
+  // Conditions based on tags for immediate sampling decisions.
+  repeated TagSamplingRule tag_rules = 4;
+}
+
+// TagSamplingRule keeps traces whose tag value satisfies a matcher, with an
+// optional probabilistic sample rate (used by the tail-sampling gate).
+message TagSamplingRule {
+  string tag_key = 1;
+  // Matching condition against the tag's value.
+  oneof matcher {
+    // Exact string equality.
+    string equals = 2;
+    // Value is one of a set.
+    StringList in = 3;
+    // RE2 regular expression match.
+    string regex = 4;
+    // Tag is present with any value.
+    bool exists = 5;
+  }
+  // Probabilistic keep rate applied when the matcher matches (deterministic
+  // hash(trace_id) < sample_rate); set to 1.0 to make a match an absolute 
sure-keep.
+  double sample_rate = 6 [(validate.rules).double = {
+    gte: 0.0
+    lte: 1.0
+  }];
+}
+
+// TagMatcher is a sure-keep tag predicate used by StageRule.keep_tag_rules.
+// Unlike TagSamplingRule, it carries no sample_rate — a match is an absolute 
keep.
+message TagMatcher {
+  string tag_key = 1;
+  // Matching condition against the tag's value.

Review Comment:
   `TagMatcher.tag_key` should be non-empty for the same reason as 
`TagSamplingRule.tag_key`; an empty key yields a config that can't match any 
tag.
   



##########
docs/design/backlog/post-trace-scoring.md:
##########
@@ -0,0 +1,113 @@
+# Backlog: Scoring-Based Post-Trace Retention
+
+> **Status: Deferred.** Removed from the main post-trace pipeline design 
([`../post-trace-pipeline.md`](../post-trace-pipeline.md)) in favor of 
per-stage hard retention predicates (`StageRule.min_duration`, `keep_errors`, 
`keep_tag_rules`). This document preserves the previously-proposed weighted 
scoring model — formula, persistence, revision pinning, and proto messages — 
for potential future revival if a workload demonstrates a genuine need for 
weighted multi-signal retention.
+
+## Why deferred
+
+Scoring's appeal is collapsing multiple value signals (duration, errors, tag 
matches) into one weighted number per trace, with each stage setting a single 
retention threshold. Its costs, weighed against the simpler per-stage 
hard-predicate model, were:
+
+- **Opacity.** "Why was this trace dropped?" requires evaluating the formula 
against pinned weights and thresholds.
+- **Translation pain for simple cases.** Expressing "Hot keeps duration > 50 
ms" as a score threshold requires `log10(50/D_threshold + 1)` — a terrible UX 
for what should be a direct cutoff.
+- **Magic-number tuning.** Operators must calibrate weights such that the 
threshold lands meaningfully across all trace categories.
+- **Debug friction.** Predicate-based retention is self-explanatory; scoring 
requires simulation.
+
+For the workloads currently targeted (SkyWalking native and Istio/Zipkin mesh 
traces, per the showcase), retention intent is well captured by per-stage hard 
predicates. Scoring's only genuine win — relative weighting of categories 
("error_weight is 4× duration_weight") — is rarely needed in practice. If a 
workload arrives where it *is* needed, this document is the starting point for 
revival.
+
+## Architectural role (former §1.2)
+
+In the scoring model, gating (tail-sampling) decides whether a trace block 
survives at all, and scoring then governs **where it lives and how long it is 
kept** within whatever stage holds it:
+
+- **Operation Type**: Continuous utility function (scales from $0.0$ to 
$\infty$).
+- **Responsibility**: Multi-tier storage routing, TTL scheduling, downstream 
analytical profiling.
+- **Compute Profile**: Moderate CPU overhead — weighted multi-attribute 
logarithmic curves and dynamic tag-indexing comparisons.
+
+## The scoring formula
+
+The base score, computed after the trace's segment has settled 
($t_{\text{final}}$), is:
+
+$$S_{\text{trace}}(t_{\text{final}}) = w_d \cdot \log_{10}\left( 
\frac{D_{\text{total}}}{D_{\text{threshold}}} + 1 \right) + w_e \cdot 
\mathbb{I}(\text{has\_error}) + \sum_{i} w_{\text{tag}, i} \cdot 
\mathbb{I}(\text{Tag}_i = V_i)$$
+
+Where:
+
+- $D_{\text{total}}$: The total duration (latency) of the trace — timestamp 
difference between the earliest span start and the latest span end in the 
grouped trace block, clamped to $\max(0, D_{\text{total}})$ so cross-host clock 
skew cannot yield a negative value or an undefined logarithm. With the clamp, 
$D_{\text{total}}/D_{\text{threshold}} + 1 \ge 1$, so the duration term is 
always $\ge 0$.
+- $D_{\text{threshold}}$: The base duration threshold (e.g. $500\text{ ms}$).
+- $w_d$: Duration weight (`duration_weight`).
+- $w_e$: Error weight (`error_weight`).
+- $\mathbb{I}(\text{has\_error})$: $1$ if at least one span has its error 
state set, else $0$.
+- $w_{\text{tag}, i}$: Weight of the $i$-th target tag (`tag_weights`).
+- $\mathbb{I}(\text{Tag}_i = V_i)$: $1$ if any span in the trace carries that 
key/value pair, else $0$.
+
+## Score ranges and boundaries
+
+- **Lower bound:** $S_{\text{trace}} = 0.0$, reached when the clamped duration 
is $0$, no span carries an error, and no tag matches. A zero- or 
negative-duration trace lands on this $0$ floor.
+- **Upper bound:** Theoretically $[0.0, \infty)$; in practice $[0.0, 30.0]$ 
under typical production weights given normal request-timeout ceilings (≤ 60 s).
+
+### Worked example (real `sw_trace` data)
+
+Using the original Scenario 7.1 weights — $D_{\text{threshold}} = 500\text{ 
ms}$, $w_d = 2.0$, $w_e = 12.0$:
+
+- **Slowest healthy trace** — `trace_id b03bb932-…-63a35cb317f7`, `/homepage` 
on `agent::ui` → `agent::frontend`, $D_{\text{total}} = 2802\text{ ms}$, no 
error span:
+
+$$S_{\text{trace}} = 2.0 \cdot \log_{10}\left( \frac{2802}{500} + 1 \right) + 
12.0 \cdot (0) = 2.0 \cdot \log_{10}(6.604) \approx 1.64$$
+
+- **Error trace** — `trace_id 5fcdb353-…-ec8cb4a1c88a`, `POST /test` on 
`agent::app`, $D_{\text{total}} = 4\text{ ms}$, `is_error = 1`:
+
+$$S_{\text{trace}} = 2.0 \cdot \log_{10}\left( \frac{4}{500} + 1 \right) + 
12.0 \cdot (1) \approx 12.01$$
+
+The error term dominates: even a $4\text{ ms}$ failed request outscores a 
genuinely slow ($2.8\text{ s}$) but healthy page load.
+
+## Stage-stepped scoring thresholds
+
+The model is **no continuous decay**: a trace's score is computed once at 
finalization and never decays. "Aging" is expressed structurally — each stage 
sets its own `retention_score_threshold`, and the bar rises as data moves to 
colder stages. A typical profile rises Hot → Warm → Cold (e.g. `1.0` → `3.0` → 
`10.0`): a high-value trace clears every bar and reaches Cold; a low-value 
trace is dropped at the first stage whose threshold it fails. One configured 
number per stage, no power-law constants, no per-trace recomputation. Time 
enters only through *when* a partition reaches each stage, governed by 
`LifecycleStage.ttl` / `segment_interval`.
+
+## Score persistence and pipeline revision
+
+"Computed once and frozen" is only safe if both the frozen value and the 
configuration that produced it are physically persisted alongside each trace.
+
+**Per-trace post-trace metadata stream.** The finalization pass writes one 
record per surviving trace into a per-trace metadata stream inside the part, 
parallel to the existing `spans`, `tags`, and `primary` streams. Each record is 
keyed by `trace_id` and is fully self-contained:
+
+- `score` (`float64`) — the frozen $S_{\text{trace}}(t_{\text{final}})$ value.
+- `pipeline_revision` (`int64`) — the value of 
`TracePipelineConfig.metadata.modRevision` of the pipeline that produced the 
score (audit/debug).
+- `stage_thresholds` (`map<string, float64>`) — the resolved per-stage 
`retention_score_threshold` map at that revision (e.g. `{hot: 1.0, warm: 3.0, 
cold: 10.0}`). Pinning the thresholds — not just the revision id — keeps each 
trace's retention destiny self-describing, with no dependency on retained 
pipeline-config history at comparison time.
+
+Engine-managed (not a user-declared tag), so no `Trace` schema change is 
required. Overhead is on the order of tens of bytes per trace.
+
+**Rewrites preserve all three values.** Compaction merges and the 
pre-migration rewrite copy per-trace columns by `trace_id` via the existing 
`blockReader` / `blockWriter` machinery; the post-trace metadata stream is 
carried alongside `spans` / `tags` / `primary`. Subsequent rewrites never 
recompute `score` from spans on a finalized trace.
+
+**Read paths.** Pre-migration is the primary consumer: by the time a segment 
migrates it is past its stage TTL ≫ `finalize_grace`, so every surviving trace 
already has a record. The compaction filter consults the record when present; 
for the brief window in which a trace is mature (per `merge_grace`) but its 
segment has not yet been finalized, no record exists yet — the filter computes 
`score` on the fly and uses the current pipeline's threshold (this matches what 
the later finalization will pin).
+
+**Config-edit semantics: forward-looking, never retroactive.** Editing 
`TracePipelineConfig` advances `metadata.modRevision`; traces finalized under 
an earlier revision keep their pinned `score`, `pipeline_revision`, and 
`stage_thresholds`. Retroactive re-evaluation requires an explicit operator 
action.
+
+## Proto schema (for revival)
+
+If revived, scoring would reintroduce these messages to 
`api/proto/banyandb/pipeline/v1/trace_pipeline.proto`:
+
+```Plain Text
+message ScoringRules {
+  double duration_weight = 1 [(validate.rules).double = {gte: 0.0}];
+  double error_weight    = 2 [(validate.rules).double = {gte: 0.0}];
+  repeated TagWeight tag_weights = 3;
+}
+
+message TagWeight {
+  string key   = 1;
+  string value = 2;
+  double weight = 3 [(validate.rules).double = {gte: 0.0}];
+}
+```
+
+And `StageRule` would carry a per-stage threshold (field 2 is reserved in the 
current proto exactly so this revival is wire-safe):
+
+```Plain Text
+double retention_score_threshold = 2 [(validate.rules).double = {gte: 0.0}];
+```
+
+Plus a `ScoringRules scoring_rules = 7;` field on `TracePipelineConfig` (field 
7 is reserved in the current proto for the same reason).
+

Review Comment:
   This section claims `StageRule` field 2 and `TracePipelineConfig` field 7 
are reserved for scoring revival, but in the current proto `StageRule` field 2 
is `apply_at` and `TracePipelineConfig` field 7 is `merge_grace`. As written, 
this doc would mislead future implementers about wire-compatibility constraints.
   



##########
docs/design/post-trace-pipeline.md:
##########
@@ -0,0 +1,493 @@
+# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping 
Specification
+
+This document presents the complete technical design for the **BanyanDB 
Storage-Node Post-Trace Pipeline** and its integration with physical hardware 
storage tiers and time-aging schedulers.
+
+Unlike traditional streaming-based telemetry engines that perform span 
assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a 
**native trace model**. Spans are stored, sorted, and indexed by `trace_id` 
directly within the storage engine layout on individual data nodes.
+
+By decoupling trace assembly from the active ingestion path, the post-trace 
pipeline executes asynchronous tail-sampling, per-stage retention filtering, 
and data reduction natively during storage lifecycle events on the data nodes. 
This design is grounded in recent academic advancements in post-hoc retroactive 
tracing and storage-level file merge analysis.
+
+## Architectural Concept: Decoupled Gating and Per-Stage Retention
+
+To maximize storage efficiency and compute performance on the BanyanDB data 
nodes, trace evaluation is separated into two logical phases, which 
subsequently feed into an automated **time-aging system** for partition-level 
migration:
+
+```Plain Text
++-----------------------------------+
+                  |      Grouped Trace Assembly       |
+                  +-----------------------------------+
+                                    |
+                                    v
++-----------------------------------------------------------------------+
+| 1. Tail-Sampling Gating (tail_sampling, at finalization)              |
+|    - Goal: Decide whether a trace is retained at all                  |
+|    - Criteria: sure-keep rules (duration / errors / tags) + sampling  |
++-----------------------------------------------------------------------+
+                                    |
+                  +-----------------+-----------------+
+                  | (Retain)                          | (Drop / Purge)
+                  v                                   v
++---------------------------------------------------+ +-----------------+
+| 2. Per-Stage Retention (StageRule predicates)     | |  Discard Block  |
+|    - Goal: At each stage's lifecycle event,       | |  Reclaim Space  |
+|      keep / drop per the stage's predicates       | +-----------------+
+|    - Criteria: min_duration / keep_errors /       |
+|                keep_tag_rules, per stage          |
++---------------------------------------------------+
+                  |
+                  v
++-----------------------------------------------------------------------+
+| 3. Time-Aging System (Stage-Stepped Migration Engine)                 |
+|    - Goal: Each stage's StageRule governs survival at its lifecycle   |
+|      events (compaction, finalization, migration_out)                 |
+|    - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order)      |
+|    - RULE: medium is a node-group concern (may be heterogeneous);     |
+|      predicates govern per-stage retention, not medium selection.     |
++-----------------------------------------------------------------------+
+```
+
+### 1.1 Gating (`tail_sampling`)
+
+- **Operation Type**: Tail-sampling filter (keep/drop decision at 
finalization). It combines absolute *sure-keep* rules with a hashed 
probabilistic sample of everything else; it is not a pure boolean classifier.
+
+- **Responsibility**: Initial filtering at the finalization pass (§8.3). It 
determines whether an assembled trace block **survives** at all — passing into 
the stage lifecycle — or is **purged** immediately to reclaim storage space on 
disk.
+
+- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules 
(duration boundaries, error presence, tag match) plus a single `hash(trace_id) 
< healthy_sample_rate` comparison. The probabilistic keep is a deterministic 
`hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / 
drop decision is stable only once the trace has stopped growing: a partial 
trace's duration is a lower bound, so any drop is deferred until the trace's 
last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured 
trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) 
then yields the same keep/drop result.
+
+### 1.2 Per-Stage Retention (StageRule predicates)
+
+- **Operation Type**: Per-stage keep/drop predicate evaluation. Each 
`StageRule` declares which trace properties (`min_duration`, `keep_errors`, 
`keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails 
every predicate is dropped at the stage's next configured lifecycle event.
+
+- **Responsibility**: Decides **which traces survive each stage** as data 
ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed 
by increasingly strict predicates at each stage's `MIGRATION_OUT` (or 
`COMPACTION`) event. The pipeline does **not** choose the physical storage 
medium; that is a node-group / `LifecycleStage` placement concern (§4.1).
+
+- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata 
(free, no decode); `has_error` requires a decoded read of span bodies only when 
`keep_errors` is set; tag matchers run against indexed tags. Predicates are 
direct boolean checks.
+
+## Protobuf Message Design
+
+The pipeline configuration is a single trace-typed message, 
`TracePipelineConfig`. Rather than a parallel abstract metadata layer, it 
reuses the existing catalog identifiers — Group (via `metadata`), lifecycle 
stage names, and schema names — and adds only the trace-specific tail-sampling 
and per-stage retention rules. General stream parsing and metrics aggregation 
configurations are excluded to maintain a strict focus on trace-centric 
analytical operations.
+
+### 2.1 Targeting Model: Reuse Existing Catalog Identifiers
+
+This design deliberately does **not** introduce an abstract `Pipeline` 
resource or an `ExecutionTrigger` enum. Both would re-declare targeting that 
the storage model already owns and let the two drift. A trace pipeline is 
instead expressed entirely with identifiers that already exist:
+
+- **Group** — named by the config's own `metadata.group` 
(`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that 
Group, exactly as every other schema resource does. The Group already fixes the 
`catalog`, so catalog is never repeated.
+- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage 
from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and 
carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, 
`keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary 
queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules 
are configured here, per stage and per pipeline — there are no hardcoded 
predicates in the engine.
+- **Schema selector** — an explicit `schema_names` list (exact match on 
`common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches 
if it is listed OR matches the regex; both empty targets every schema in the 
Group.
+
+The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced 
by **stage lifecycle events**, because each trigger was already anchored to a 
stage: compaction is an LSM merge within a stage (§8.1), migration is a segment 
leaving one stage for the next (§8.2), and the scheduled pass finalizes a 
settled time segment within a stage (§8.3). Each `StageRule` carries its own 
`apply_at` selecting which of those events run the filter; empty means all of 
them. This keeps the trace-specific rules out of the generic, catalog-agnostic 
`common.v1.LifecycleStage` (which stream and measure groups share) while still 
binding them to stages by name.
+
+Multiple `TracePipelineConfig`s may coexist in a Group only when their 
effective coverage is disjoint; the schema registry enforces this with a 
per-tuple uniqueness rule (§2.3).
+
+### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`)
+
+The full message definitions live in the proto source at 
`api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package 
`banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is 
the single root resource: it carries its own identity (`metadata`), the 
targeting fields from §2.1, and the trace-specific tail-sampling and per-stage 
retention rules. There is no embedded abstract `Pipeline` and no 
`ExecutionTrigger`; the lifecycle events that run the filter are named by 
`apply_at`.
+
+The message set is:
+
+- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the 
per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, 
the `tail_sampling` (gating) block, and the completeness windows `merge_grace` 
(§8.1) and `finalize_grace` (§8.3).
+- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the 
per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, 
`min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are 
OR-combined: a trace survives if any one matches; it is dropped if it fails all 
set predicates. A rule with no predicates set at all has no filtering effect 
(see §4.2).
+- **`StageEvent`** — enum of the lifecycle events a rule may run at: 
`STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and 
`STAGE_EVENT_MIGRATION_OUT` (§8.2).
+- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: 
`duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` 
(each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`).
+- **`TagMatcher`** — sure-keep tag predicate used by 
`StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries 
no `sample_rate` — a match is an absolute keep.
+
+### 2.3 Uniqueness and Conflict Policy
+
+Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most 
one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` 
tuple**. The schema registry enforces this on write so the filter behavior at 
any tuple is deterministic — there is no implicit ordering, priority, or 
composition of overlapping pipelines.
+
+**Effective coverage of a pipeline.** A `TracePipelineConfig` P with 
`enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when:
+
+- `schema` is a schema in `P.metadata.group` whose name is listed in 
`P.schema_names` or matches `P.schema_name_regex` (both empty selectors target 
every schema in the Group);
+- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every 
stage in the Group's `ResourceOpts`);
+- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers 
every `StageEvent`).
+
+**Conflict detection on write.** When a `TracePipelineConfig` is created or 
updated, the registry expands its effective coverage against the current 
schemas and stages of its Group and rejects the write if **any** covered tuple 
is already covered by another `enabled` pipeline in the same Group. The error 
names the conflicting pipeline and the overlapping tuple(s).
+
+**Schema additions.** When a new `Trace` schema is created in a Group, the 
registry re-validates every `enabled` `TracePipelineConfig` in that Group 
against the new schema. If the new schema would cause two pipelines to overlap 
on any `(stage, event)`, the schema creation is rejected; the operator must 
narrow the pipelines (e.g. add explicit `schema_names`) before adding the 
schema.
+
+**Disabling and replacement.** Setting `enabled=false` removes a pipeline from 
active coverage immediately, freeing its tuples for another pipeline to claim. 
This is the safe path for atomically swapping in a new pipeline: disable the 
old, enable or create the new, then delete the old. There is no implicit 
precedence between two `enabled` pipelines — the registry will not pick a 
winner.
+
+**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and 
migration (§8.2) the engine selects the active pipeline by `(group, schema, 
stage, event)`. If a registry inconsistency ever exposes more than one active 
match for a tuple, the engine logs an error and applies **none** of them — 
failing safe to retain rather than risk a nondeterministic destructive drop.
+
+**Recommended pattern.** For most workloads, a single pipeline per Group with 
a broad `schema_name_regex` is the simplest configuration. Multiple pipelines 
in a Group are useful only when their schema selectors are mutually exclusive 
(for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in 
a hypothetical mixed group) so no tuple is doubly covered.
+
+### 2.4 Admission Validation
+
+The proto pins bounds with PGV so a malformed config is rejected at parse time 
rather than producing undefined behavior. The schema registry's admission 
control adds one cross-field rule that PGV cannot express.
+
+**PGV-enforced bounds** (rejected at proto parse / `Write` time):
+
+- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 
0s`).
+- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in 
`[0.0, 1.0]`.
+- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset 
means duration is not a retention factor for that stage.
+- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — 
strictly positive when set; unset means engine default (§8.1 / §8.3).
+
+**Admission rule (server-side; not expressible in PGV):**
+
+- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = 
true`, the config must declare at least one of: a non-empty `tail_sampling`, or 
at least one `StageRule` with at least one retention predicate (`min_duration`, 
`keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with 
no gating and no per-stage predicates has no effect on traces and is rejected 
as a configuration error. To temporarily disable a pipeline, set `enabled = 
false` — do not strip its rule blocks.
+
+Both the registry write path and the runtime apply these checks. If a registry 
inconsistency ever lets a malformed config reach the engine, the engine logs an 
error and falls back to **retain** for every tuple the malformed config would 
have governed — never a destructive drop (consistent with the §2.3 runtime 
fail-safe).
+
+## Retention-Decision Timing & Trace Completeness
+
+Both the tail-sampling gate (§1.1) and the per-stage retention predicates 
(§1.2) operate on the **whole** trace, not on individual spans as they arrive: 
`D_total` needs the earliest start and latest end, `has_error` scans all spans, 
and tag matchers scan all spans. Spans of one trace, however, arrive at the 
storage node anywhere from milliseconds to hours apart, and BanyanDB writes 
each span into the segment selected by its **event time** (the `start_time` 
tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no 
live partial-trace buffer; a trace exists only as spans co-located by 
`trace_id` inside a segment's parts.
+
+### 3.1 When Retention Decisions Are Safe
+
+Predicates are not evaluated for destructive drops at write time. They run 
lazily, by the post-trace passes that re-assemble a trace by `trace_id`:
+
+- **Provisional** — during an LSM compaction merge (§8.1). This sees only the 
spans that have landed so far, so a drop verdict could be premature; the 
per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace 
has stopped growing.
+- **Authoritative** — at the post-trace scheduler's finalization pass once the 
trace's segment has **settled**: its event-time window has closed *and* the 
event-time watermark has advanced past it by the `finalize_grace` period 
(§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates 
the next segment up to an hour *before* the current window ends, and late data 
keeps routing to an old segment by event time — so "settled" is a watermark 
heuristic, not a guarantee.
+- **At migration_out** — by the time a segment is migrated to the next stage 
(≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every 
trace it holds has been settled, so predicate evaluation is final.
+
+Two completeness caveats follow from event-time segmentation:
+
+1. **Spans arriving after finalization.** A span whose `start_time` falls in 
an already-finalized segment is still accepted (while the segment is within 
retention) but is missed by the finalization-time evaluation. The 
`finalize_grace` period bounds how often this happens; a late span that lands 
after it is only re-evaluated if a subsequent compaction rewrites that part.
+
+2. **Traces spanning multiple segments.** A trace longer than 
`segment_interval`, or one that straddles a boundary, is physically split 
across two segments, and each segment's pass evaluates only its own fragment 
(`D_total` and the indicators are fragment-local). BanyanDB performs no 
cross-segment trace assembly at the storage layer. Where trace duration is far 
below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day 
segments) the fragment equals the whole trace and this is a non-issue; for 
long-running traces it is a known limitation.
+
+## Integration of the Retention System & Time-Aging System
+
+Distributed telemetry's diagnostic utility naturally decays as time passes. 
The pipeline must therefore decide, per stage, what to keep — but it must 
**not** try to pick the physical storage medium, because in BanyanDB the medium 
is already determined by where a `LifecycleStage` is placed.
+
+### 4.1 Stage Placement vs. Rule-Governed Retention
+
+The physical medium is **not** fixed per tier by this design. Each 
`LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through 
its `node_selector`; the medium is whatever hardware that node group runs, and 
operators may deploy heterogeneous backings (for example two warm node groups 
on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA 
SSD/HDD, Cold → object storage, but that mapping is an operator deployment 
choice, not a rule this pipeline enforces.
+
+Because medium selection already belongs to stage placement, per-stage 
`StageRule` predicates **do not route traces to a medium**. Within whatever 
stage currently holds a trace, the predicates only act as a **retention gate** 
— dictating whether a trace is retained when a partition is rewritten or 
migrated, or discarded/pruned to reduce storage footprint.
+
+### 4.2 Stage-Stepped Retention via Rising Predicates
+
+"Aging" is expressed structurally: each `StageRule` declares its own 
predicates, and the bar rises as data moves to colder stages. **Semantics of a 
`StageRule` at one of its `apply_at` events:** the set predicates are 
OR-combined — a trace is retained if **any** set predicate matches it, and 
dropped if it fails all of them. A `StageRule` with no predicates configured at 
all has no filtering effect (every trace at that stage is retained); to 
actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, 
or `keep_tag_rules` must be set.
+
+A typical profile rises Hot → Warm → Cold by tightening predicates at each 
stage's `MIGRATION_OUT` event. For example:
+
+- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag 
(PostgreSQL, ActiveMQ, etc.).
+- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the 
most-important tag.
+- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces 
are dropped, only errors persist for the full 30-day retention.
+
+Predicates are direct boolean checks, so retention is deterministic and 
self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` 
matched, even though it was only 3 ms." Time enters only through *when* a 
partition reaches each stage, which is governed by `LifecycleStage.ttl` / 
`segment_interval`.
+
+## Downstream Lifecycle Actions Governed by Per-Stage Predicates
+
+The time-aging engine evaluates each trace against its current stage's 
`StageRule` predicates at the stage's configured lifecycle events, filtering 
data blocks as partitions are rewritten or migrated between lifecycle stages.
+
+### 6.1 Compaction Rewrites

Review Comment:
   Section numbering jumps from §4.2 to §6.1, which makes cross-references 
harder to follow and looks unintentional.
   



##########
docs/design/post-trace-pipeline.md:
##########
@@ -0,0 +1,493 @@
+# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping 
Specification
+
+This document presents the complete technical design for the **BanyanDB 
Storage-Node Post-Trace Pipeline** and its integration with physical hardware 
storage tiers and time-aging schedulers.
+
+Unlike traditional streaming-based telemetry engines that perform span 
assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a 
**native trace model**. Spans are stored, sorted, and indexed by `trace_id` 
directly within the storage engine layout on individual data nodes.
+
+By decoupling trace assembly from the active ingestion path, the post-trace 
pipeline executes asynchronous tail-sampling, per-stage retention filtering, 
and data reduction natively during storage lifecycle events on the data nodes. 
This design is grounded in recent academic advancements in post-hoc retroactive 
tracing and storage-level file merge analysis.
+
+## Architectural Concept: Decoupled Gating and Per-Stage Retention
+
+To maximize storage efficiency and compute performance on the BanyanDB data 
nodes, trace evaluation is separated into two logical phases, which 
subsequently feed into an automated **time-aging system** for partition-level 
migration:
+
+```Plain Text
++-----------------------------------+
+                  |      Grouped Trace Assembly       |
+                  +-----------------------------------+
+                                    |
+                                    v
++-----------------------------------------------------------------------+
+| 1. Tail-Sampling Gating (tail_sampling, at finalization)              |
+|    - Goal: Decide whether a trace is retained at all                  |
+|    - Criteria: sure-keep rules (duration / errors / tags) + sampling  |
++-----------------------------------------------------------------------+
+                                    |
+                  +-----------------+-----------------+
+                  | (Retain)                          | (Drop / Purge)
+                  v                                   v
++---------------------------------------------------+ +-----------------+
+| 2. Per-Stage Retention (StageRule predicates)     | |  Discard Block  |
+|    - Goal: At each stage's lifecycle event,       | |  Reclaim Space  |
+|      keep / drop per the stage's predicates       | +-----------------+
+|    - Criteria: min_duration / keep_errors /       |
+|                keep_tag_rules, per stage          |
++---------------------------------------------------+
+                  |
+                  v
++-----------------------------------------------------------------------+
+| 3. Time-Aging System (Stage-Stepped Migration Engine)                 |
+|    - Goal: Each stage's StageRule governs survival at its lifecycle   |
+|      events (compaction, finalization, migration_out)                 |
+|    - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order)      |
+|    - RULE: medium is a node-group concern (may be heterogeneous);     |
+|      predicates govern per-stage retention, not medium selection.     |
++-----------------------------------------------------------------------+
+```
+
+### 1.1 Gating (`tail_sampling`)
+
+- **Operation Type**: Tail-sampling filter (keep/drop decision at 
finalization). It combines absolute *sure-keep* rules with a hashed 
probabilistic sample of everything else; it is not a pure boolean classifier.
+
+- **Responsibility**: Initial filtering at the finalization pass (§8.3). It 
determines whether an assembled trace block **survives** at all — passing into 
the stage lifecycle — or is **purged** immediately to reclaim storage space on 
disk.
+
+- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules 
(duration boundaries, error presence, tag match) plus a single `hash(trace_id) 
< healthy_sample_rate` comparison. The probabilistic keep is a deterministic 
`hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / 
drop decision is stable only once the trace has stopped growing: a partial 
trace's duration is a lower bound, so any drop is deferred until the trace's 
last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured 
trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) 
then yields the same keep/drop result.
+
+### 1.2 Per-Stage Retention (StageRule predicates)
+
+- **Operation Type**: Per-stage keep/drop predicate evaluation. Each 
`StageRule` declares which trace properties (`min_duration`, `keep_errors`, 
`keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails 
every predicate is dropped at the stage's next configured lifecycle event.
+
+- **Responsibility**: Decides **which traces survive each stage** as data 
ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed 
by increasingly strict predicates at each stage's `MIGRATION_OUT` (or 
`COMPACTION`) event. The pipeline does **not** choose the physical storage 
medium; that is a node-group / `LifecycleStage` placement concern (§4.1).
+
+- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata 
(free, no decode); `has_error` requires a decoded read of span bodies only when 
`keep_errors` is set; tag matchers run against indexed tags. Predicates are 
direct boolean checks.
+
+## Protobuf Message Design
+
+The pipeline configuration is a single trace-typed message, 
`TracePipelineConfig`. Rather than a parallel abstract metadata layer, it 
reuses the existing catalog identifiers — Group (via `metadata`), lifecycle 
stage names, and schema names — and adds only the trace-specific tail-sampling 
and per-stage retention rules. General stream parsing and metrics aggregation 
configurations are excluded to maintain a strict focus on trace-centric 
analytical operations.
+
+### 2.1 Targeting Model: Reuse Existing Catalog Identifiers
+
+This design deliberately does **not** introduce an abstract `Pipeline` 
resource or an `ExecutionTrigger` enum. Both would re-declare targeting that 
the storage model already owns and let the two drift. A trace pipeline is 
instead expressed entirely with identifiers that already exist:
+
+- **Group** — named by the config's own `metadata.group` 
(`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that 
Group, exactly as every other schema resource does. The Group already fixes the 
`catalog`, so catalog is never repeated.
+- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage 
from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and 
carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, 
`keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary 
queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules 
are configured here, per stage and per pipeline — there are no hardcoded 
predicates in the engine.
+- **Schema selector** — an explicit `schema_names` list (exact match on 
`common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches 
if it is listed OR matches the regex; both empty targets every schema in the 
Group.
+
+The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced 
by **stage lifecycle events**, because each trigger was already anchored to a 
stage: compaction is an LSM merge within a stage (§8.1), migration is a segment 
leaving one stage for the next (§8.2), and the scheduled pass finalizes a 
settled time segment within a stage (§8.3). Each `StageRule` carries its own 
`apply_at` selecting which of those events run the filter; empty means all of 
them. This keeps the trace-specific rules out of the generic, catalog-agnostic 
`common.v1.LifecycleStage` (which stream and measure groups share) while still 
binding them to stages by name.
+
+Multiple `TracePipelineConfig`s may coexist in a Group only when their 
effective coverage is disjoint; the schema registry enforces this with a 
per-tuple uniqueness rule (§2.3).
+
+### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`)
+
+The full message definitions live in the proto source at 
`api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package 
`banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is 
the single root resource: it carries its own identity (`metadata`), the 
targeting fields from §2.1, and the trace-specific tail-sampling and per-stage 
retention rules. There is no embedded abstract `Pipeline` and no 
`ExecutionTrigger`; the lifecycle events that run the filter are named by 
`apply_at`.
+
+The message set is:
+
+- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the 
per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, 
the `tail_sampling` (gating) block, and the completeness windows `merge_grace` 
(§8.1) and `finalize_grace` (§8.3).
+- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the 
per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, 
`min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are 
OR-combined: a trace survives if any one matches; it is dropped if it fails all 
set predicates. A rule with no predicates set at all has no filtering effect 
(see §4.2).
+- **`StageEvent`** — enum of the lifecycle events a rule may run at: 
`STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and 
`STAGE_EVENT_MIGRATION_OUT` (§8.2).
+- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: 
`duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` 
(each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`).
+- **`TagMatcher`** — sure-keep tag predicate used by 
`StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries 
no `sample_rate` — a match is an absolute keep.
+
+### 2.3 Uniqueness and Conflict Policy
+
+Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most 
one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` 
tuple**. The schema registry enforces this on write so the filter behavior at 
any tuple is deterministic — there is no implicit ordering, priority, or 
composition of overlapping pipelines.
+
+**Effective coverage of a pipeline.** A `TracePipelineConfig` P with 
`enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when:
+
+- `schema` is a schema in `P.metadata.group` whose name is listed in 
`P.schema_names` or matches `P.schema_name_regex` (both empty selectors target 
every schema in the Group);
+- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every 
stage in the Group's `ResourceOpts`);
+- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers 
every `StageEvent`).
+
+**Conflict detection on write.** When a `TracePipelineConfig` is created or 
updated, the registry expands its effective coverage against the current 
schemas and stages of its Group and rejects the write if **any** covered tuple 
is already covered by another `enabled` pipeline in the same Group. The error 
names the conflicting pipeline and the overlapping tuple(s).
+
+**Schema additions.** When a new `Trace` schema is created in a Group, the 
registry re-validates every `enabled` `TracePipelineConfig` in that Group 
against the new schema. If the new schema would cause two pipelines to overlap 
on any `(stage, event)`, the schema creation is rejected; the operator must 
narrow the pipelines (e.g. add explicit `schema_names`) before adding the 
schema.
+
+**Disabling and replacement.** Setting `enabled=false` removes a pipeline from 
active coverage immediately, freeing its tuples for another pipeline to claim. 
This is the safe path for atomically swapping in a new pipeline: disable the 
old, enable or create the new, then delete the old. There is no implicit 
precedence between two `enabled` pipelines — the registry will not pick a 
winner.
+
+**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and 
migration (§8.2) the engine selects the active pipeline by `(group, schema, 
stage, event)`. If a registry inconsistency ever exposes more than one active 
match for a tuple, the engine logs an error and applies **none** of them — 
failing safe to retain rather than risk a nondeterministic destructive drop.
+
+**Recommended pattern.** For most workloads, a single pipeline per Group with 
a broad `schema_name_regex` is the simplest configuration. Multiple pipelines 
in a Group are useful only when their schema selectors are mutually exclusive 
(for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in 
a hypothetical mixed group) so no tuple is doubly covered.
+
+### 2.4 Admission Validation
+
+The proto pins bounds with PGV so a malformed config is rejected at parse time 
rather than producing undefined behavior. The schema registry's admission 
control adds one cross-field rule that PGV cannot express.
+
+**PGV-enforced bounds** (rejected at proto parse / `Write` time):
+
+- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 
0s`).
+- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in 
`[0.0, 1.0]`.
+- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset 
means duration is not a retention factor for that stage.
+- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — 
strictly positive when set; unset means engine default (§8.1 / §8.3).
+
+**Admission rule (server-side; not expressible in PGV):**
+
+- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = 
true`, the config must declare at least one of: a non-empty `tail_sampling`, or 
at least one `StageRule` with at least one retention predicate (`min_duration`, 
`keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with 
no gating and no per-stage predicates has no effect on traces and is rejected 
as a configuration error. To temporarily disable a pipeline, set `enabled = 
false` — do not strip its rule blocks.
+
+Both the registry write path and the runtime apply these checks. If a registry 
inconsistency ever lets a malformed config reach the engine, the engine logs an 
error and falls back to **retain** for every tuple the malformed config would 
have governed — never a destructive drop (consistent with the §2.3 runtime 
fail-safe).
+
+## Retention-Decision Timing & Trace Completeness
+
+Both the tail-sampling gate (§1.1) and the per-stage retention predicates 
(§1.2) operate on the **whole** trace, not on individual spans as they arrive: 
`D_total` needs the earliest start and latest end, `has_error` scans all spans, 
and tag matchers scan all spans. Spans of one trace, however, arrive at the 
storage node anywhere from milliseconds to hours apart, and BanyanDB writes 
each span into the segment selected by its **event time** (the `start_time` 
tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no 
live partial-trace buffer; a trace exists only as spans co-located by 
`trace_id` inside a segment's parts.
+
+### 3.1 When Retention Decisions Are Safe
+
+Predicates are not evaluated for destructive drops at write time. They run 
lazily, by the post-trace passes that re-assemble a trace by `trace_id`:
+
+- **Provisional** — during an LSM compaction merge (§8.1). This sees only the 
spans that have landed so far, so a drop verdict could be premature; the 
per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace 
has stopped growing.
+- **Authoritative** — at the post-trace scheduler's finalization pass once the 
trace's segment has **settled**: its event-time window has closed *and* the 
event-time watermark has advanced past it by the `finalize_grace` period 
(§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates 
the next segment up to an hour *before* the current window ends, and late data 
keeps routing to an old segment by event time — so "settled" is a watermark 
heuristic, not a guarantee.
+- **At migration_out** — by the time a segment is migrated to the next stage 
(≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every 
trace it holds has been settled, so predicate evaluation is final.
+
+Two completeness caveats follow from event-time segmentation:
+
+1. **Spans arriving after finalization.** A span whose `start_time` falls in 
an already-finalized segment is still accepted (while the segment is within 
retention) but is missed by the finalization-time evaluation. The 
`finalize_grace` period bounds how often this happens; a late span that lands 
after it is only re-evaluated if a subsequent compaction rewrites that part.
+
+2. **Traces spanning multiple segments.** A trace longer than 
`segment_interval`, or one that straddles a boundary, is physically split 
across two segments, and each segment's pass evaluates only its own fragment 
(`D_total` and the indicators are fragment-local). BanyanDB performs no 
cross-segment trace assembly at the storage layer. Where trace duration is far 
below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day 
segments) the fragment equals the whole trace and this is a non-issue; for 
long-running traces it is a known limitation.
+
+## Integration of the Retention System & Time-Aging System
+
+Distributed telemetry's diagnostic utility naturally decays as time passes. 
The pipeline must therefore decide, per stage, what to keep — but it must 
**not** try to pick the physical storage medium, because in BanyanDB the medium 
is already determined by where a `LifecycleStage` is placed.
+
+### 4.1 Stage Placement vs. Rule-Governed Retention
+
+The physical medium is **not** fixed per tier by this design. Each 
`LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through 
its `node_selector`; the medium is whatever hardware that node group runs, and 
operators may deploy heterogeneous backings (for example two warm node groups 
on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA 
SSD/HDD, Cold → object storage, but that mapping is an operator deployment 
choice, not a rule this pipeline enforces.
+
+Because medium selection already belongs to stage placement, per-stage 
`StageRule` predicates **do not route traces to a medium**. Within whatever 
stage currently holds a trace, the predicates only act as a **retention gate** 
— dictating whether a trace is retained when a partition is rewritten or 
migrated, or discarded/pruned to reduce storage footprint.
+
+### 4.2 Stage-Stepped Retention via Rising Predicates
+
+"Aging" is expressed structurally: each `StageRule` declares its own 
predicates, and the bar rises as data moves to colder stages. **Semantics of a 
`StageRule` at one of its `apply_at` events:** the set predicates are 
OR-combined — a trace is retained if **any** set predicate matches it, and 
dropped if it fails all of them. A `StageRule` with no predicates configured at 
all has no filtering effect (every trace at that stage is retained); to 
actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, 
or `keep_tag_rules` must be set.
+
+A typical profile rises Hot → Warm → Cold by tightening predicates at each 
stage's `MIGRATION_OUT` event. For example:
+
+- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag 
(PostgreSQL, ActiveMQ, etc.).
+- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the 
most-important tag.
+- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces 
are dropped, only errors persist for the full 30-day retention.
+
+Predicates are direct boolean checks, so retention is deterministic and 
self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` 
matched, even though it was only 3 ms." Time enters only through *when* a 
partition reaches each stage, which is governed by `LifecycleStage.ttl` / 
`segment_interval`.
+
+## Downstream Lifecycle Actions Governed by Per-Stage Predicates
+
+The time-aging engine evaluates each trace against its current stage's 
`StageRule` predicates at the stage's configured lifecycle events, filtering 
data blocks as partitions are rewritten or migrated between lifecycle stages.
+
+### 6.1 Compaction Rewrites
+
+During normal LSM merge operations inside a stage, the data node compacts 
older parts. When a `StageRule` includes `STAGE_EVENT_COMPACTION` in its 
`apply_at`, the merge filter (§8.1) evaluates each mature trace against the 
stage's predicates and omits non-matching traces from the consolidated output. 
Compaction is then both a space-reclamation pass and a per-stage retention 
pass; pipelines that target only migration boundaries leave the routine LSM 
compaction byte-for-byte lossless.
+
+### 6.2 Partition-Level Tier Migration
+
+Both showcase trace groups (`sw_trace`, `sw_zipkinTrace`) declare a Hot → Warm 
→ Cold lifecycle in `ResourceOpts.stages`: Hot holds the active window (1-day 
ttl), Warm runs on `node_selector type=warm` (7-day ttl), and Cold runs on 
`node_selector type=cold` with `close=true` (30-day ttl). When a Hot segment 
matures past its 1-day ttl, the migration worker copies it to the Warm node 
group; the medium behind each node group (e.g. NVMe → SATA SSD → object 
storage) is the operator's deployment choice (§4.1), not something this 
pipeline picks.
+
+- The migration engine evaluates the source stage's `MIGRATION_OUT` predicates 
against every trace in the partition.
+
+- **No dynamic splitting is performed:** all retained data is written to the 
next stage's node group. Traces that fail the source stage's `MIGRATION_OUT` 
predicates are omitted from the target write stream, reducing the physical size 
of the migrated partition. With Scenario 7.1's Hot predicates a healthy 
`/homepage` trace (2802 ms) matches `min_duration: 100ms` and migrates to Warm; 
a PostgreSQL-touching trace matches `keep_tag_rules` and migrates too; a 
healthy fast trace (6 ms) matches none and is dropped.
+
+- When the partition matures past the Warm 7-day ttl, the Warm stage's 
`MIGRATION_OUT` predicates apply (typically stricter — e.g. `min_duration: 
500ms`); only matching traces are written into the Cold parts. Healthy baseline 
traces, even ones that survived Warm, are dropped here; error traces (with 
`keep_errors: true`) persist all the way to Cold.
+
+### 6.3 Eviction: Per-Trace Drop During Rewrites, Segment-Granularity GC

Review Comment:
   Section numbering jumps from §4.2 to §6.3; renumbering to §5.3 would keep 
the document's subsection numbering contiguous.
   



##########
docs/design/post-trace-pipeline.md:
##########
@@ -0,0 +1,493 @@
+# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping 
Specification
+
+This document presents the complete technical design for the **BanyanDB 
Storage-Node Post-Trace Pipeline** and its integration with physical hardware 
storage tiers and time-aging schedulers.
+
+Unlike traditional streaming-based telemetry engines that perform span 
assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a 
**native trace model**. Spans are stored, sorted, and indexed by `trace_id` 
directly within the storage engine layout on individual data nodes.
+
+By decoupling trace assembly from the active ingestion path, the post-trace 
pipeline executes asynchronous tail-sampling, per-stage retention filtering, 
and data reduction natively during storage lifecycle events on the data nodes. 
This design is grounded in recent academic advancements in post-hoc retroactive 
tracing and storage-level file merge analysis.
+
+## Architectural Concept: Decoupled Gating and Per-Stage Retention
+
+To maximize storage efficiency and compute performance on the BanyanDB data 
nodes, trace evaluation is separated into two logical phases, which 
subsequently feed into an automated **time-aging system** for partition-level 
migration:
+
+```Plain Text
++-----------------------------------+
+                  |      Grouped Trace Assembly       |
+                  +-----------------------------------+
+                                    |
+                                    v
++-----------------------------------------------------------------------+
+| 1. Tail-Sampling Gating (tail_sampling, at finalization)              |
+|    - Goal: Decide whether a trace is retained at all                  |
+|    - Criteria: sure-keep rules (duration / errors / tags) + sampling  |
++-----------------------------------------------------------------------+
+                                    |
+                  +-----------------+-----------------+
+                  | (Retain)                          | (Drop / Purge)
+                  v                                   v
++---------------------------------------------------+ +-----------------+
+| 2. Per-Stage Retention (StageRule predicates)     | |  Discard Block  |
+|    - Goal: At each stage's lifecycle event,       | |  Reclaim Space  |
+|      keep / drop per the stage's predicates       | +-----------------+
+|    - Criteria: min_duration / keep_errors /       |
+|                keep_tag_rules, per stage          |
++---------------------------------------------------+
+                  |
+                  v
++-----------------------------------------------------------------------+
+| 3. Time-Aging System (Stage-Stepped Migration Engine)                 |
+|    - Goal: Each stage's StageRule governs survival at its lifecycle   |
+|      events (compaction, finalization, migration_out)                 |
+|    - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order)      |
+|    - RULE: medium is a node-group concern (may be heterogeneous);     |
+|      predicates govern per-stage retention, not medium selection.     |
++-----------------------------------------------------------------------+
+```
+
+### 1.1 Gating (`tail_sampling`)
+
+- **Operation Type**: Tail-sampling filter (keep/drop decision at 
finalization). It combines absolute *sure-keep* rules with a hashed 
probabilistic sample of everything else; it is not a pure boolean classifier.
+
+- **Responsibility**: Initial filtering at the finalization pass (§8.3). It 
determines whether an assembled trace block **survives** at all — passing into 
the stage lifecycle — or is **purged** immediately to reclaim storage space on 
disk.
+
+- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules 
(duration boundaries, error presence, tag match) plus a single `hash(trace_id) 
< healthy_sample_rate` comparison. The probabilistic keep is a deterministic 
`hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / 
drop decision is stable only once the trace has stopped growing: a partial 
trace's duration is a lower bound, so any drop is deferred until the trace's 
last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured 
trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) 
then yields the same keep/drop result.
+
+### 1.2 Per-Stage Retention (StageRule predicates)
+
+- **Operation Type**: Per-stage keep/drop predicate evaluation. Each 
`StageRule` declares which trace properties (`min_duration`, `keep_errors`, 
`keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails 
every predicate is dropped at the stage's next configured lifecycle event.
+
+- **Responsibility**: Decides **which traces survive each stage** as data 
ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed 
by increasingly strict predicates at each stage's `MIGRATION_OUT` (or 
`COMPACTION`) event. The pipeline does **not** choose the physical storage 
medium; that is a node-group / `LifecycleStage` placement concern (§4.1).
+
+- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata 
(free, no decode); `has_error` requires a decoded read of span bodies only when 
`keep_errors` is set; tag matchers run against indexed tags. Predicates are 
direct boolean checks.
+
+## Protobuf Message Design
+
+The pipeline configuration is a single trace-typed message, 
`TracePipelineConfig`. Rather than a parallel abstract metadata layer, it 
reuses the existing catalog identifiers — Group (via `metadata`), lifecycle 
stage names, and schema names — and adds only the trace-specific tail-sampling 
and per-stage retention rules. General stream parsing and metrics aggregation 
configurations are excluded to maintain a strict focus on trace-centric 
analytical operations.
+
+### 2.1 Targeting Model: Reuse Existing Catalog Identifiers
+
+This design deliberately does **not** introduce an abstract `Pipeline` 
resource or an `ExecutionTrigger` enum. Both would re-declare targeting that 
the storage model already owns and let the two drift. A trace pipeline is 
instead expressed entirely with identifiers that already exist:
+
+- **Group** — named by the config's own `metadata.group` 
(`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that 
Group, exactly as every other schema resource does. The Group already fixes the 
`catalog`, so catalog is never repeated.
+- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage 
from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and 
carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, 
`keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary 
queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules 
are configured here, per stage and per pipeline — there are no hardcoded 
predicates in the engine.
+- **Schema selector** — an explicit `schema_names` list (exact match on 
`common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches 
if it is listed OR matches the regex; both empty targets every schema in the 
Group.
+
+The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced 
by **stage lifecycle events**, because each trigger was already anchored to a 
stage: compaction is an LSM merge within a stage (§8.1), migration is a segment 
leaving one stage for the next (§8.2), and the scheduled pass finalizes a 
settled time segment within a stage (§8.3). Each `StageRule` carries its own 
`apply_at` selecting which of those events run the filter; empty means all of 
them. This keeps the trace-specific rules out of the generic, catalog-agnostic 
`common.v1.LifecycleStage` (which stream and measure groups share) while still 
binding them to stages by name.
+
+Multiple `TracePipelineConfig`s may coexist in a Group only when their 
effective coverage is disjoint; the schema registry enforces this with a 
per-tuple uniqueness rule (§2.3).
+
+### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`)
+
+The full message definitions live in the proto source at 
`api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package 
`banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is 
the single root resource: it carries its own identity (`metadata`), the 
targeting fields from §2.1, and the trace-specific tail-sampling and per-stage 
retention rules. There is no embedded abstract `Pipeline` and no 
`ExecutionTrigger`; the lifecycle events that run the filter are named by 
`apply_at`.
+
+The message set is:
+
+- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the 
per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, 
the `tail_sampling` (gating) block, and the completeness windows `merge_grace` 
(§8.1) and `finalize_grace` (§8.3).
+- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the 
per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, 
`min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are 
OR-combined: a trace survives if any one matches; it is dropped if it fails all 
set predicates. A rule with no predicates set at all has no filtering effect 
(see §4.2).
+- **`StageEvent`** — enum of the lifecycle events a rule may run at: 
`STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and 
`STAGE_EVENT_MIGRATION_OUT` (§8.2).
+- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: 
`duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` 
(each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`).
+- **`TagMatcher`** — sure-keep tag predicate used by 
`StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries 
no `sample_rate` — a match is an absolute keep.
+
+### 2.3 Uniqueness and Conflict Policy
+
+Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most 
one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` 
tuple**. The schema registry enforces this on write so the filter behavior at 
any tuple is deterministic — there is no implicit ordering, priority, or 
composition of overlapping pipelines.
+
+**Effective coverage of a pipeline.** A `TracePipelineConfig` P with 
`enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when:
+
+- `schema` is a schema in `P.metadata.group` whose name is listed in 
`P.schema_names` or matches `P.schema_name_regex` (both empty selectors target 
every schema in the Group);
+- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every 
stage in the Group's `ResourceOpts`);
+- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers 
every `StageEvent`).
+
+**Conflict detection on write.** When a `TracePipelineConfig` is created or 
updated, the registry expands its effective coverage against the current 
schemas and stages of its Group and rejects the write if **any** covered tuple 
is already covered by another `enabled` pipeline in the same Group. The error 
names the conflicting pipeline and the overlapping tuple(s).
+
+**Schema additions.** When a new `Trace` schema is created in a Group, the 
registry re-validates every `enabled` `TracePipelineConfig` in that Group 
against the new schema. If the new schema would cause two pipelines to overlap 
on any `(stage, event)`, the schema creation is rejected; the operator must 
narrow the pipelines (e.g. add explicit `schema_names`) before adding the 
schema.
+
+**Disabling and replacement.** Setting `enabled=false` removes a pipeline from 
active coverage immediately, freeing its tuples for another pipeline to claim. 
This is the safe path for atomically swapping in a new pipeline: disable the 
old, enable or create the new, then delete the old. There is no implicit 
precedence between two `enabled` pipelines — the registry will not pick a 
winner.
+
+**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and 
migration (§8.2) the engine selects the active pipeline by `(group, schema, 
stage, event)`. If a registry inconsistency ever exposes more than one active 
match for a tuple, the engine logs an error and applies **none** of them — 
failing safe to retain rather than risk a nondeterministic destructive drop.
+
+**Recommended pattern.** For most workloads, a single pipeline per Group with 
a broad `schema_name_regex` is the simplest configuration. Multiple pipelines 
in a Group are useful only when their schema selectors are mutually exclusive 
(for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in 
a hypothetical mixed group) so no tuple is doubly covered.
+
+### 2.4 Admission Validation
+
+The proto pins bounds with PGV so a malformed config is rejected at parse time 
rather than producing undefined behavior. The schema registry's admission 
control adds one cross-field rule that PGV cannot express.
+
+**PGV-enforced bounds** (rejected at proto parse / `Write` time):
+
+- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 
0s`).
+- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in 
`[0.0, 1.0]`.
+- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset 
means duration is not a retention factor for that stage.
+- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — 
strictly positive when set; unset means engine default (§8.1 / §8.3).
+
+**Admission rule (server-side; not expressible in PGV):**
+
+- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = 
true`, the config must declare at least one of: a non-empty `tail_sampling`, or 
at least one `StageRule` with at least one retention predicate (`min_duration`, 
`keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with 
no gating and no per-stage predicates has no effect on traces and is rejected 
as a configuration error. To temporarily disable a pipeline, set `enabled = 
false` — do not strip its rule blocks.
+
+Both the registry write path and the runtime apply these checks. If a registry 
inconsistency ever lets a malformed config reach the engine, the engine logs an 
error and falls back to **retain** for every tuple the malformed config would 
have governed — never a destructive drop (consistent with the §2.3 runtime 
fail-safe).
+
+## Retention-Decision Timing & Trace Completeness
+
+Both the tail-sampling gate (§1.1) and the per-stage retention predicates 
(§1.2) operate on the **whole** trace, not on individual spans as they arrive: 
`D_total` needs the earliest start and latest end, `has_error` scans all spans, 
and tag matchers scan all spans. Spans of one trace, however, arrive at the 
storage node anywhere from milliseconds to hours apart, and BanyanDB writes 
each span into the segment selected by its **event time** (the `start_time` 
tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no 
live partial-trace buffer; a trace exists only as spans co-located by 
`trace_id` inside a segment's parts.
+
+### 3.1 When Retention Decisions Are Safe
+
+Predicates are not evaluated for destructive drops at write time. They run 
lazily, by the post-trace passes that re-assemble a trace by `trace_id`:
+
+- **Provisional** — during an LSM compaction merge (§8.1). This sees only the 
spans that have landed so far, so a drop verdict could be premature; the 
per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace 
has stopped growing.
+- **Authoritative** — at the post-trace scheduler's finalization pass once the 
trace's segment has **settled**: its event-time window has closed *and* the 
event-time watermark has advanced past it by the `finalize_grace` period 
(§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates 
the next segment up to an hour *before* the current window ends, and late data 
keeps routing to an old segment by event time — so "settled" is a watermark 
heuristic, not a guarantee.
+- **At migration_out** — by the time a segment is migrated to the next stage 
(≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every 
trace it holds has been settled, so predicate evaluation is final.
+
+Two completeness caveats follow from event-time segmentation:
+
+1. **Spans arriving after finalization.** A span whose `start_time` falls in 
an already-finalized segment is still accepted (while the segment is within 
retention) but is missed by the finalization-time evaluation. The 
`finalize_grace` period bounds how often this happens; a late span that lands 
after it is only re-evaluated if a subsequent compaction rewrites that part.
+
+2. **Traces spanning multiple segments.** A trace longer than 
`segment_interval`, or one that straddles a boundary, is physically split 
across two segments, and each segment's pass evaluates only its own fragment 
(`D_total` and the indicators are fragment-local). BanyanDB performs no 
cross-segment trace assembly at the storage layer. Where trace duration is far 
below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day 
segments) the fragment equals the whole trace and this is a non-issue; for 
long-running traces it is a known limitation.
+
+## Integration of the Retention System & Time-Aging System
+
+Distributed telemetry's diagnostic utility naturally decays as time passes. 
The pipeline must therefore decide, per stage, what to keep — but it must 
**not** try to pick the physical storage medium, because in BanyanDB the medium 
is already determined by where a `LifecycleStage` is placed.
+
+### 4.1 Stage Placement vs. Rule-Governed Retention
+
+The physical medium is **not** fixed per tier by this design. Each 
`LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through 
its `node_selector`; the medium is whatever hardware that node group runs, and 
operators may deploy heterogeneous backings (for example two warm node groups 
on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA 
SSD/HDD, Cold → object storage, but that mapping is an operator deployment 
choice, not a rule this pipeline enforces.
+
+Because medium selection already belongs to stage placement, per-stage 
`StageRule` predicates **do not route traces to a medium**. Within whatever 
stage currently holds a trace, the predicates only act as a **retention gate** 
— dictating whether a trace is retained when a partition is rewritten or 
migrated, or discarded/pruned to reduce storage footprint.
+
+### 4.2 Stage-Stepped Retention via Rising Predicates
+
+"Aging" is expressed structurally: each `StageRule` declares its own 
predicates, and the bar rises as data moves to colder stages. **Semantics of a 
`StageRule` at one of its `apply_at` events:** the set predicates are 
OR-combined — a trace is retained if **any** set predicate matches it, and 
dropped if it fails all of them. A `StageRule` with no predicates configured at 
all has no filtering effect (every trace at that stage is retained); to 
actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, 
or `keep_tag_rules` must be set.
+
+A typical profile rises Hot → Warm → Cold by tightening predicates at each 
stage's `MIGRATION_OUT` event. For example:
+
+- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag 
(PostgreSQL, ActiveMQ, etc.).
+- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the 
most-important tag.
+- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces 
are dropped, only errors persist for the full 30-day retention.
+
+Predicates are direct boolean checks, so retention is deterministic and 
self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` 
matched, even though it was only 3 ms." Time enters only through *when* a 
partition reaches each stage, which is governed by `LifecycleStage.ttl` / 
`segment_interval`.
+
+## Downstream Lifecycle Actions Governed by Per-Stage Predicates
+
+The time-aging engine evaluates each trace against its current stage's 
`StageRule` predicates at the stage's configured lifecycle events, filtering 
data blocks as partitions are rewritten or migrated between lifecycle stages.
+
+### 6.1 Compaction Rewrites
+
+During normal LSM merge operations inside a stage, the data node compacts 
older parts. When a `StageRule` includes `STAGE_EVENT_COMPACTION` in its 
`apply_at`, the merge filter (§8.1) evaluates each mature trace against the 
stage's predicates and omits non-matching traces from the consolidated output. 
Compaction is then both a space-reclamation pass and a per-stage retention 
pass; pipelines that target only migration boundaries leave the routine LSM 
compaction byte-for-byte lossless.
+
+### 6.2 Partition-Level Tier Migration

Review Comment:
   Section numbering jumps from §4.2 to §6.2; if §6 is meant to be the next 
major section after §4, consider renumbering these subsections to keep the 
sequence consistent.
   



##########
api/proto/banyandb/pipeline/v1/trace_pipeline.proto:
##########
@@ -0,0 +1,187 @@
+// Licensed to Apache Software Foundation (ASF) under one or more contributor
+// license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright
+// ownership. Apache Software Foundation (ASF) licenses this file to you under
+// the Apache License, Version 2.0 (the "License"); you may
+// not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto3";
+
+package banyandb.pipeline.v1;
+
+import "banyandb/common/v1/common.proto";
+import "google/protobuf/duration.proto";
+import "validate/validate.proto";
+
+option go_package = 
"github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1";
+option java_package = "org.apache.skywalking.banyandb.pipeline.v1";
+
+// StageEvent enumerates the lifecycle events of a stage at which the filter 
runs.
+enum StageEvent {
+  STAGE_EVENT_UNSPECIFIED = 0;
+  // In-line filter during an LSM merge within a stage.
+  STAGE_EVENT_COMPACTION = 1;
+  // Scheduled final filter once a segment has settled: its event-time window
+  // has closed and the watermark has advanced past it by a grace period.
+  // (BanyanDB has no hard "seal" event; settling is watermark-driven.)
+  STAGE_EVENT_FINALIZE = 2;
+  // Pre-migration rewrite when a segment leaves this stage for the next.
+  STAGE_EVENT_MIGRATION_OUT = 3;
+}
+
+// TracePipelineConfig is the root configuration for a storage-node trace 
pipeline.
+// It reuses existing catalog identifiers (group via metadata, stage names, 
schema
+// names) for targeting instead of declaring a parallel metadata model.
+message TracePipelineConfig {
+  // Identity and revision tracking; metadata.group is the Group this pipeline
+  // lives in and applies to, consistent with every other schema resource.
+  common.v1.Metadata metadata = 1;
+  // Active status of the pipeline.
+  bool enabled = 2;
+  // Per-stage retention rules: which lifecycle stages this pipeline acts on,
+  // with the keep-predicates and lifecycle events for each. Empty means the
+  // tail-sampling gate at finalization is the only filter; no per-stage drop
+  // is applied.
+  repeated StageRule stages = 3;
+  // Explicit schema names to target within the Group (exact match on 
Metadata.name).
+  repeated string schema_names = 4;
+  // RE2 regular expression matched against schema names. A schema is targeted 
if it
+  // is listed in schema_names OR matches this pattern. When both are empty, 
every
+  // schema in the Group is targeted.
+  string schema_name_regex = 5;
+  // Tail-sampling (gating) rules applied at finalization (§8.3) to decide
+  // whether a trace is retained at all once its segment has settled.
+  TailSampling tail_sampling = 6;
+  // Per-trace maturity window for the in-line merge filter (§8.1). A trace is
+  // eligible for dropping during an LSM compaction merge only once its latest
+  // span timestamp is older than `now - merge_grace`; younger traces pass
+  // through the merge unchanged. Bounds the expected intra-trace span arrival
+  // spread (typically seconds). Must be strictly positive if set; if unset, 
the
+  // engine uses a documented default (currently 30s).
+  google.protobuf.Duration merge_grace = 7 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+  // Per-segment settling window for the scheduled finalization pass (§8.3). A
+  // segment is treated as settled, and the authoritative final filter runs,
+  // once the event-time watermark exceeds `segment.End + finalize_grace`.
+  // Bounds segment-wide late arrival (typically minutes). Must be strictly
+  // positive if set; if unset, the engine uses a documented default
+  // (currently 5m). Distinct from `merge_grace`: `merge_grace` is per-trace,
+  // `finalize_grace` is per-segment.
+  google.protobuf.Duration finalize_grace = 8 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+}
+
+// StageRule binds the pipeline to one lifecycle stage of the targeted Group
+// and declares per-stage retention predicates.
+//
+// Semantics (when the rule fires at one of its `apply_at` events):
+//   - If every predicate field is unset (no `min_duration`, no `keep_errors`,
+//     and an empty `keep_tag_rules`) the rule has no filtering effect — every
+//     trace at this stage is retained, and the tail-sampling gate at
+//     finalization remains the only filter affecting traces here.
+//   - Otherwise the *set* predicates are combined with OR: a trace is retained
+//     if any single set predicate matches it; a trace that fails all set
+//     predicates is dropped from this stage at the firing event.
+//
+// Within `keep_tag_rules`, the individual `TagMatcher` entries are likewise
+// OR-combined (any matcher match satisfies the keep_tag_rules predicate).
+message StageRule {
+  // Stage name from the Group's ResourceOpts.stages (e.g. "hot", "warm", 
"cold").
+  string stage = 1;
+  // Which of this stage's lifecycle events run the filter. Empty means all of
+  // them (compaction merges, segment finalization, and outbound migration).
+  repeated StageEvent apply_at = 2;
+  // Minimum total trace duration to retain at this stage. A trace whose
+  // duration is at least min_duration is retained. Strictly positive if set;
+  // if unset, duration is not a retention factor for this stage.
+  google.protobuf.Duration min_duration = 3 [(validate.rules).duration = {
+    gt: {seconds: 0}
+  }];
+  // If true, traces with any error span are retained at this stage regardless
+  // of duration or tag predicates.
+  bool keep_errors = 4;
+  // Sure-keep tag matchers: a trace satisfies this predicate when ANY of the
+  // matchers matches any of its spans (matchers are OR-combined). Distinct
+  // from TailSampling.tag_rules (which is for ingest-time gating with a
+  // sample_rate); TagMatcher has no sample_rate — a match is an absolute keep.
+  repeated TagMatcher keep_tag_rules = 5;
+}
+
+// TailSampling manages tail-sampling (ingest-gating) decisions at 
finalization.
+message TailSampling {
+  // Absolute trace duration threshold. Traces whose total latency meets or
+  // exceeds this value are preserved (a sure-keep at gating). Required and
+  // strictly positive.
+  google.protobuf.Duration duration_threshold = 1 [(validate.rules).duration = 
{
+    required: true
+    gt: {seconds: 0}
+  }];
+  // Immediate retention rule when any span contains an error state.
+  bool keep_all_errors = 2;
+  // Probabilistic keep rate for healthy, low-latency traces that match no
+  // sure-keep rule. The decision is a deterministic hash(trace_id) < rate, so
+  // it is stable across re-evaluation. Must be in [0.0, 1.0].
+  double healthy_sample_rate = 3 [(validate.rules).double = {
+    gte: 0.0
+    lte: 1.0
+  }];
+  // Conditions based on tags for immediate sampling decisions.
+  repeated TagSamplingRule tag_rules = 4;
+}
+
+// TagSamplingRule keeps traces whose tag value satisfies a matcher, with an
+// optional probabilistic sample rate (used by the tail-sampling gate).
+message TagSamplingRule {
+  string tag_key = 1;
+  // Matching condition against the tag's value.

Review Comment:
   `TagSamplingRule.tag_key` should be non-empty; otherwise a rule can be 
accepted but never match any real tag. The codebase typically validates 
identifier strings with `string.min_len = 1`.
   



##########
docs/design/post-trace-pipeline.md:
##########
@@ -0,0 +1,493 @@
+# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping 
Specification
+
+This document presents the complete technical design for the **BanyanDB 
Storage-Node Post-Trace Pipeline** and its integration with physical hardware 
storage tiers and time-aging schedulers.
+
+Unlike traditional streaming-based telemetry engines that perform span 
assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a 
**native trace model**. Spans are stored, sorted, and indexed by `trace_id` 
directly within the storage engine layout on individual data nodes.
+
+By decoupling trace assembly from the active ingestion path, the post-trace 
pipeline executes asynchronous tail-sampling, per-stage retention filtering, 
and data reduction natively during storage lifecycle events on the data nodes. 
This design is grounded in recent academic advancements in post-hoc retroactive 
tracing and storage-level file merge analysis.
+
+## Architectural Concept: Decoupled Gating and Per-Stage Retention
+
+To maximize storage efficiency and compute performance on the BanyanDB data 
nodes, trace evaluation is separated into two logical phases, which 
subsequently feed into an automated **time-aging system** for partition-level 
migration:
+
+```Plain Text
++-----------------------------------+
+                  |      Grouped Trace Assembly       |
+                  +-----------------------------------+
+                                    |
+                                    v
++-----------------------------------------------------------------------+
+| 1. Tail-Sampling Gating (tail_sampling, at finalization)              |
+|    - Goal: Decide whether a trace is retained at all                  |
+|    - Criteria: sure-keep rules (duration / errors / tags) + sampling  |
++-----------------------------------------------------------------------+
+                                    |
+                  +-----------------+-----------------+
+                  | (Retain)                          | (Drop / Purge)
+                  v                                   v
++---------------------------------------------------+ +-----------------+
+| 2. Per-Stage Retention (StageRule predicates)     | |  Discard Block  |
+|    - Goal: At each stage's lifecycle event,       | |  Reclaim Space  |
+|      keep / drop per the stage's predicates       | +-----------------+
+|    - Criteria: min_duration / keep_errors /       |
+|                keep_tag_rules, per stage          |
++---------------------------------------------------+
+                  |
+                  v
++-----------------------------------------------------------------------+
+| 3. Time-Aging System (Stage-Stepped Migration Engine)                 |
+|    - Goal: Each stage's StageRule governs survival at its lifecycle   |
+|      events (compaction, finalization, migration_out)                 |
+|    - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order)      |
+|    - RULE: medium is a node-group concern (may be heterogeneous);     |
+|      predicates govern per-stage retention, not medium selection.     |
++-----------------------------------------------------------------------+
+```
+
+### 1.1 Gating (`tail_sampling`)
+
+- **Operation Type**: Tail-sampling filter (keep/drop decision at 
finalization). It combines absolute *sure-keep* rules with a hashed 
probabilistic sample of everything else; it is not a pure boolean classifier.
+
+- **Responsibility**: Initial filtering at the finalization pass (§8.3). It 
determines whether an assembled trace block **survives** at all — passing into 
the stage lifecycle — or is **purged** immediately to reclaim storage space on 
disk.
+
+- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules 
(duration boundaries, error presence, tag match) plus a single `hash(trace_id) 
< healthy_sample_rate` comparison. The probabilistic keep is a deterministic 
`hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / 
drop decision is stable only once the trace has stopped growing: a partial 
trace's duration is a lower bound, so any drop is deferred until the trace's 
last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured 
trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) 
then yields the same keep/drop result.
+
+### 1.2 Per-Stage Retention (StageRule predicates)
+
+- **Operation Type**: Per-stage keep/drop predicate evaluation. Each 
`StageRule` declares which trace properties (`min_duration`, `keep_errors`, 
`keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails 
every predicate is dropped at the stage's next configured lifecycle event.
+
+- **Responsibility**: Decides **which traces survive each stage** as data 
ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed 
by increasingly strict predicates at each stage's `MIGRATION_OUT` (or 
`COMPACTION`) event. The pipeline does **not** choose the physical storage 
medium; that is a node-group / `LifecycleStage` placement concern (§4.1).
+
+- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata 
(free, no decode); `has_error` requires a decoded read of span bodies only when 
`keep_errors` is set; tag matchers run against indexed tags. Predicates are 
direct boolean checks.
+
+## Protobuf Message Design
+
+The pipeline configuration is a single trace-typed message, 
`TracePipelineConfig`. Rather than a parallel abstract metadata layer, it 
reuses the existing catalog identifiers — Group (via `metadata`), lifecycle 
stage names, and schema names — and adds only the trace-specific tail-sampling 
and per-stage retention rules. General stream parsing and metrics aggregation 
configurations are excluded to maintain a strict focus on trace-centric 
analytical operations.
+
+### 2.1 Targeting Model: Reuse Existing Catalog Identifiers
+
+This design deliberately does **not** introduce an abstract `Pipeline` 
resource or an `ExecutionTrigger` enum. Both would re-declare targeting that 
the storage model already owns and let the two drift. A trace pipeline is 
instead expressed entirely with identifiers that already exist:
+
+- **Group** — named by the config's own `metadata.group` 
(`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that 
Group, exactly as every other schema resource does. The Group already fixes the 
`catalog`, so catalog is never repeated.
+- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage 
from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and 
carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, 
`keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary 
queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules 
are configured here, per stage and per pipeline — there are no hardcoded 
predicates in the engine.
+- **Schema selector** — an explicit `schema_names` list (exact match on 
`common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches 
if it is listed OR matches the regex; both empty targets every schema in the 
Group.
+
+The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced 
by **stage lifecycle events**, because each trigger was already anchored to a 
stage: compaction is an LSM merge within a stage (§8.1), migration is a segment 
leaving one stage for the next (§8.2), and the scheduled pass finalizes a 
settled time segment within a stage (§8.3). Each `StageRule` carries its own 
`apply_at` selecting which of those events run the filter; empty means all of 
them. This keeps the trace-specific rules out of the generic, catalog-agnostic 
`common.v1.LifecycleStage` (which stream and measure groups share) while still 
binding them to stages by name.
+
+Multiple `TracePipelineConfig`s may coexist in a Group only when their 
effective coverage is disjoint; the schema registry enforces this with a 
per-tuple uniqueness rule (§2.3).
+
+### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`)
+
+The full message definitions live in the proto source at 
`api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package 
`banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is 
the single root resource: it carries its own identity (`metadata`), the 
targeting fields from §2.1, and the trace-specific tail-sampling and per-stage 
retention rules. There is no embedded abstract `Pipeline` and no 
`ExecutionTrigger`; the lifecycle events that run the filter are named by 
`apply_at`.
+
+The message set is:
+
+- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the 
per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, 
the `tail_sampling` (gating) block, and the completeness windows `merge_grace` 
(§8.1) and `finalize_grace` (§8.3).
+- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the 
per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, 
`min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are 
OR-combined: a trace survives if any one matches; it is dropped if it fails all 
set predicates. A rule with no predicates set at all has no filtering effect 
(see §4.2).
+- **`StageEvent`** — enum of the lifecycle events a rule may run at: 
`STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and 
`STAGE_EVENT_MIGRATION_OUT` (§8.2).
+- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: 
`duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` 
(each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`).
+- **`TagMatcher`** — sure-keep tag predicate used by 
`StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries 
no `sample_rate` — a match is an absolute keep.
+
+### 2.3 Uniqueness and Conflict Policy
+
+Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most 
one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` 
tuple**. The schema registry enforces this on write so the filter behavior at 
any tuple is deterministic — there is no implicit ordering, priority, or 
composition of overlapping pipelines.
+
+**Effective coverage of a pipeline.** A `TracePipelineConfig` P with 
`enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when:
+
+- `schema` is a schema in `P.metadata.group` whose name is listed in 
`P.schema_names` or matches `P.schema_name_regex` (both empty selectors target 
every schema in the Group);
+- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every 
stage in the Group's `ResourceOpts`);
+- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers 
every `StageEvent`).
+
+**Conflict detection on write.** When a `TracePipelineConfig` is created or 
updated, the registry expands its effective coverage against the current 
schemas and stages of its Group and rejects the write if **any** covered tuple 
is already covered by another `enabled` pipeline in the same Group. The error 
names the conflicting pipeline and the overlapping tuple(s).
+
+**Schema additions.** When a new `Trace` schema is created in a Group, the 
registry re-validates every `enabled` `TracePipelineConfig` in that Group 
against the new schema. If the new schema would cause two pipelines to overlap 
on any `(stage, event)`, the schema creation is rejected; the operator must 
narrow the pipelines (e.g. add explicit `schema_names`) before adding the 
schema.
+
+**Disabling and replacement.** Setting `enabled=false` removes a pipeline from 
active coverage immediately, freeing its tuples for another pipeline to claim. 
This is the safe path for atomically swapping in a new pipeline: disable the 
old, enable or create the new, then delete the old. There is no implicit 
precedence between two `enabled` pipelines — the registry will not pick a 
winner.
+
+**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and 
migration (§8.2) the engine selects the active pipeline by `(group, schema, 
stage, event)`. If a registry inconsistency ever exposes more than one active 
match for a tuple, the engine logs an error and applies **none** of them — 
failing safe to retain rather than risk a nondeterministic destructive drop.
+
+**Recommended pattern.** For most workloads, a single pipeline per Group with 
a broad `schema_name_regex` is the simplest configuration. Multiple pipelines 
in a Group are useful only when their schema selectors are mutually exclusive 
(for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in 
a hypothetical mixed group) so no tuple is doubly covered.
+
+### 2.4 Admission Validation
+
+The proto pins bounds with PGV so a malformed config is rejected at parse time 
rather than producing undefined behavior. The schema registry's admission 
control adds one cross-field rule that PGV cannot express.
+
+**PGV-enforced bounds** (rejected at proto parse / `Write` time):
+
+- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 
0s`).
+- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in 
`[0.0, 1.0]`.
+- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset 
means duration is not a retention factor for that stage.
+- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — 
strictly positive when set; unset means engine default (§8.1 / §8.3).
+
+**Admission rule (server-side; not expressible in PGV):**
+
+- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = 
true`, the config must declare at least one of: a non-empty `tail_sampling`, or 
at least one `StageRule` with at least one retention predicate (`min_duration`, 
`keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with 
no gating and no per-stage predicates has no effect on traces and is rejected 
as a configuration error. To temporarily disable a pipeline, set `enabled = 
false` — do not strip its rule blocks.
+
+Both the registry write path and the runtime apply these checks. If a registry 
inconsistency ever lets a malformed config reach the engine, the engine logs an 
error and falls back to **retain** for every tuple the malformed config would 
have governed — never a destructive drop (consistent with the §2.3 runtime 
fail-safe).
+
+## Retention-Decision Timing & Trace Completeness
+
+Both the tail-sampling gate (§1.1) and the per-stage retention predicates 
(§1.2) operate on the **whole** trace, not on individual spans as they arrive: 
`D_total` needs the earliest start and latest end, `has_error` scans all spans, 
and tag matchers scan all spans. Spans of one trace, however, arrive at the 
storage node anywhere from milliseconds to hours apart, and BanyanDB writes 
each span into the segment selected by its **event time** (the `start_time` 
tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no 
live partial-trace buffer; a trace exists only as spans co-located by 
`trace_id` inside a segment's parts.
+
+### 3.1 When Retention Decisions Are Safe
+
+Predicates are not evaluated for destructive drops at write time. They run 
lazily, by the post-trace passes that re-assemble a trace by `trace_id`:
+
+- **Provisional** — during an LSM compaction merge (§8.1). This sees only the 
spans that have landed so far, so a drop verdict could be premature; the 
per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace 
has stopped growing.
+- **Authoritative** — at the post-trace scheduler's finalization pass once the 
trace's segment has **settled**: its event-time window has closed *and* the 
event-time watermark has advanced past it by the `finalize_grace` period 
(§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates 
the next segment up to an hour *before* the current window ends, and late data 
keeps routing to an old segment by event time — so "settled" is a watermark 
heuristic, not a guarantee.
+- **At migration_out** — by the time a segment is migrated to the next stage 
(≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every 
trace it holds has been settled, so predicate evaluation is final.
+
+Two completeness caveats follow from event-time segmentation:
+
+1. **Spans arriving after finalization.** A span whose `start_time` falls in 
an already-finalized segment is still accepted (while the segment is within 
retention) but is missed by the finalization-time evaluation. The 
`finalize_grace` period bounds how often this happens; a late span that lands 
after it is only re-evaluated if a subsequent compaction rewrites that part.
+
+2. **Traces spanning multiple segments.** A trace longer than 
`segment_interval`, or one that straddles a boundary, is physically split 
across two segments, and each segment's pass evaluates only its own fragment 
(`D_total` and the indicators are fragment-local). BanyanDB performs no 
cross-segment trace assembly at the storage layer. Where trace duration is far 
below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day 
segments) the fragment equals the whole trace and this is a non-issue; for 
long-running traces it is a known limitation.
+
+## Integration of the Retention System & Time-Aging System
+
+Distributed telemetry's diagnostic utility naturally decays as time passes. 
The pipeline must therefore decide, per stage, what to keep — but it must 
**not** try to pick the physical storage medium, because in BanyanDB the medium 
is already determined by where a `LifecycleStage` is placed.
+
+### 4.1 Stage Placement vs. Rule-Governed Retention
+
+The physical medium is **not** fixed per tier by this design. Each 
`LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through 
its `node_selector`; the medium is whatever hardware that node group runs, and 
operators may deploy heterogeneous backings (for example two warm node groups 
on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA 
SSD/HDD, Cold → object storage, but that mapping is an operator deployment 
choice, not a rule this pipeline enforces.
+
+Because medium selection already belongs to stage placement, per-stage 
`StageRule` predicates **do not route traces to a medium**. Within whatever 
stage currently holds a trace, the predicates only act as a **retention gate** 
— dictating whether a trace is retained when a partition is rewritten or 
migrated, or discarded/pruned to reduce storage footprint.
+
+### 4.2 Stage-Stepped Retention via Rising Predicates
+
+"Aging" is expressed structurally: each `StageRule` declares its own 
predicates, and the bar rises as data moves to colder stages. **Semantics of a 
`StageRule` at one of its `apply_at` events:** the set predicates are 
OR-combined — a trace is retained if **any** set predicate matches it, and 
dropped if it fails all of them. A `StageRule` with no predicates configured at 
all has no filtering effect (every trace at that stage is retained); to 
actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, 
or `keep_tag_rules` must be set.
+
+A typical profile rises Hot → Warm → Cold by tightening predicates at each 
stage's `MIGRATION_OUT` event. For example:
+
+- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag 
(PostgreSQL, ActiveMQ, etc.).
+- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the 
most-important tag.
+- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces 
are dropped, only errors persist for the full 30-day retention.
+
+Predicates are direct boolean checks, so retention is deterministic and 
self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` 
matched, even though it was only 3 ms." Time enters only through *when* a 
partition reaches each stage, which is governed by `LifecycleStage.ttl` / 
`segment_interval`.
+
+## Downstream Lifecycle Actions Governed by Per-Stage Predicates
+
+The time-aging engine evaluates each trace against its current stage's 
`StageRule` predicates at the stage's configured lifecycle events, filtering 
data blocks as partitions are rewritten or migrated between lifecycle stages.
+
+### 6.1 Compaction Rewrites
+
+During normal LSM merge operations inside a stage, the data node compacts 
older parts. When a `StageRule` includes `STAGE_EVENT_COMPACTION` in its 
`apply_at`, the merge filter (§8.1) evaluates each mature trace against the 
stage's predicates and omits non-matching traces from the consolidated output. 
Compaction is then both a space-reclamation pass and a per-stage retention 
pass; pipelines that target only migration boundaries leave the routine LSM 
compaction byte-for-byte lossless.
+
+### 6.2 Partition-Level Tier Migration
+
+Both showcase trace groups (`sw_trace`, `sw_zipkinTrace`) declare a Hot → Warm 
→ Cold lifecycle in `ResourceOpts.stages`: Hot holds the active window (1-day 
ttl), Warm runs on `node_selector type=warm` (7-day ttl), and Cold runs on 
`node_selector type=cold` with `close=true` (30-day ttl). When a Hot segment 
matures past its 1-day ttl, the migration worker copies it to the Warm node 
group; the medium behind each node group (e.g. NVMe → SATA SSD → object 
storage) is the operator's deployment choice (§4.1), not something this 
pipeline picks.
+
+- The migration engine evaluates the source stage's `MIGRATION_OUT` predicates 
against every trace in the partition.
+
+- **No dynamic splitting is performed:** all retained data is written to the 
next stage's node group. Traces that fail the source stage's `MIGRATION_OUT` 
predicates are omitted from the target write stream, reducing the physical size 
of the migrated partition. With Scenario 7.1's Hot predicates a healthy 
`/homepage` trace (2802 ms) matches `min_duration: 100ms` and migrates to Warm; 
a PostgreSQL-touching trace matches `keep_tag_rules` and migrates too; a 
healthy fast trace (6 ms) matches none and is dropped.
+
+- When the partition matures past the Warm 7-day ttl, the Warm stage's 
`MIGRATION_OUT` predicates apply (typically stricter — e.g. `min_duration: 
500ms`); only matching traces are written into the Cold parts. Healthy baseline 
traces, even ones that survived Warm, are dropped here; error traces (with 
`keep_errors: true`) persist all the way to Cold.
+
+### 6.3 Eviction: Per-Trace Drop During Rewrites, Segment-Granularity GC
+
+BanyanDB does not tombstone individual traces, and this design does not add 
per-trace tombstones. The trace part layout is column-oriented (separate 
`primary`, `spans`, and `tags` streams), so there is no per-trace delete 
primitive. Eviction therefore happens at two distinct granularities:
+
+- **Per-trace removal is a side effect of the filter rewrites, not GC.** A 
trace that fails its current stage's retention predicates is simply omitted the 
next time its part is rewritten — during a compaction merge (§8.1), a 
segment-finalization pass (§8.3), or a pre-migration rewrite (§8.2). No 
tombstone is written; the trace's columns are not copied into the new part, and 
the space is reclaimed when the old part is retired.
+
+- **Whole-segment eviction stays the existing mechanism.** Reclaiming an 
entire time segment remains governed by the current retention path — TTL expiry 
plus the disk high/low watermarks (`banyand/trace/svc_standalone.go`, 
`TopicDeleteExpiredTraceSegments`). Per-stage predicates do not place 
tombstones and do not trigger segment deletion; they only change how much of a 
segment survives each rewrite.
+
+## Operational Scenario Configurations
+
+Below are two operational scenarios represented as complete 
`TracePipelineConfig` instances, one per trace group found in the 
skywalking-showcase cluster (`sw_trace` and `sw_zipkinTrace`). Group names, 
schema names, tags, latencies, and lifecycle stages are taken from real data in 
that cluster; the retention outcomes quoted are derived by evaluating the 
per-stage predicates against real sampled traces.
+
+### 7.1 Scenario 1: SkyWalking-Native Segment Retention (`sw_trace`)
+
+- **Objective**: On the showcase `sw_trace` group (schema `segment`), keep 
every error trace all the way to Cold for incident forensics, keep genuinely 
slow requests through Warm, always keep traces that touch PostgreSQL or the 
ActiveMQ `queue-songs-ping` queue at the Hot→Warm boundary, and 
probabilistically sample the healthy remainder at gating. The group's real Hot 
→ Warm → Cold stages (ttl 1d / 7d / 30d) carry rising predicate strictness at 
each `MIGRATION_OUT` event.
+
+- **Configuration JSON**:
+
+```Plain Text
+{
+  "metadata": { "group": "sw_trace", "name": "segment-tail-sampler" },
+  "enabled": true,
+  "stages": [
+    {
+      "stage": "hot",
+      "apply_at": ["STAGE_EVENT_MIGRATION_OUT"],
+      "min_duration": "0.100s",
+      "keep_errors": true,
+      "keep_tag_rules": [
+        { "tag_key": "db.type",  "equals": "PostgreSQL" },
+        { "tag_key": "mq.queue", "equals": "queue-songs-ping" }
+      ]
+    },
+    {
+      "stage": "warm",
+      "apply_at": ["STAGE_EVENT_MIGRATION_OUT"],
+      "min_duration": "0.500s",
+      "keep_errors": true,
+      "keep_tag_rules": [
+        { "tag_key": "db.type", "equals": "PostgreSQL" }
+      ]
+    },
+    {
+      "stage": "cold",
+      "apply_at": ["STAGE_EVENT_COMPACTION"],
+      "keep_errors": true
+    }
+  ],
+  "schema_names": ["segment"],
+  "tail_sampling": {
+    "duration_threshold": "0.500s",
+    "keep_all_errors": true,
+    "healthy_sample_rate": 0.1,
+    "tag_rules": [
+      { "tag_key": "db.type",  "equals": "PostgreSQL",       "sample_rate": 
1.0 },
+      { "tag_key": "mq.queue", "equals": "queue-songs-ping", "sample_rate": 
1.0 }
+    ]
+  },
+  "merge_grace": "30s",
+  "finalize_grace": "300s"
+}
+```
+
+- **Retention Dynamics** (real `sw_trace` traces, predicates evaluated at each 
stage's lifecycle event):
+
+    - The error trace `5fcdb353-…` (`POST /test`, `agent::app`, `is_error=1`, 
4 ms) is a sure-keep at gating (`keep_all_errors`). At Hot→Warm migration, 
`keep_errors` matches → kept. At Warm→Cold migration, `keep_errors` matches → 
kept. At Cold compaction, `keep_errors` matches → retained for the full 30-day 
Cold TTL.
+
+    - The slow healthy trace `b03bb932-…` (`/homepage`, `agent::ui` → 
`agent::frontend`, 2802 ms) is a sure-keep at gating (2802 ms > 500 ms 
`duration_threshold`). At Hot→Warm: 2802 ms ≥ 100 ms `min_duration` → kept. At 
Warm→Cold: 2802 ms ≥ 500 ms → kept. At Cold compaction: no error, `keep_errors` 
fails → **dropped from Cold**.
+
+    - A PostgreSQL-touching trace (e.g. `b31e4be8-…`, `agent::songs` 
`UndertowDispatch`, 3 ms, `db.type=PostgreSQL`) is sure-kept by the `db.type` 
tag rule at gating. At Hot→Warm: `keep_tag_rules` matches → kept. At Warm→Cold: 
`keep_tag_rules` (still includes `db.type`) matches → kept. At Cold compaction: 
no error → **dropped from Cold**.
+
+    - A healthy fast trace such as `GET:/songs` at 6 ms (`agent::songs`, 
`http.status_code=200`) is only kept at gating if it wins the 
`healthy_sample_rate` (`0.1`) hash. If kept, at Hot→Warm: 6 ms < 100 ms, no 
error, no tag match → **dropped at Hot→Warm migration**.
+
+### 7.2 Scenario 2: Istio / Zipkin Mesh Edge Sampling (`sw_zipkinTrace`)
+
+- **Objective**: On the showcase `sw_zipkinTrace` group (schema 
`zipkin_span`), apply a lower-cost edge sampler to the Istio service-mesh 
spans. The Zipkin schema has no first-class `is_error` column, so server errors 
are caught with a tag rule on the flattened `query` attributes rather than 
`keep_errors`; mesh gateway spans are kept by tag. Targets the group's Warm and 
Cold stages.
+
+- **Configuration JSON**:
+
+```Plain Text
+{
+  "metadata": { "group": "sw_zipkinTrace", "name": "zipkin-edge-sampler" },
+  "enabled": true,
+  "stages": [
+    {
+      "stage": "warm",
+      "apply_at": ["STAGE_EVENT_MIGRATION_OUT"],
+      "min_duration": "1s",
+      "keep_tag_rules": [
+        { "tag_key": "query", "regex": "http\\.status_code=5\\d\\d" },
+        { "tag_key": "local_endpoint_service_name", "equals": 
"gateway.sample-services" }
+      ]
+    },
+    {
+      "stage": "cold",
+      "apply_at": ["STAGE_EVENT_COMPACTION"],
+      "keep_tag_rules": [
+        { "tag_key": "query", "regex": "http\\.status_code=5\\d\\d" }
+      ]
+    }
+  ],
+  "schema_names": ["zipkin_span"],
+  "tail_sampling": {
+    "duration_threshold": "1.000s",
+    "keep_all_errors": false,
+    "healthy_sample_rate": 0.05,
+    "tag_rules": [
+      { "tag_key": "query", "regex": "http\\.status_code=5\\d\\d", 
"sample_rate": 1.0 }
+    ]
+  },
+  "merge_grace": "30s",
+  "finalize_grace": "300s"
+}
+```
+
+- **Retention Dynamics** (real `sw_zipkinTrace` spans):
+
+    - The slowest mesh call observed — `trace_id 0961e077…`, a 30.7 s 
`istio.skywalking-showcase` client span to Grafana's live-WS endpoint 
(`http.status_code=101`) — is a sure-keep at gating (30.7 s > 1 s). At 
Warm→Cold migration: 30.7 s ≥ 1 s `min_duration` → kept. At Cold compaction: 
`query` carries `http.status_code=101` (not 5xx), no other predicate → 
**dropped from Cold**.
+
+    - A gateway span on `gateway.sample-services` at the mesh p90 (~19 ms): 
would be passed by gating only via the `0.05` `healthy_sample_rate` hash (19 ms 
< 1 s, no 5xx). If kept, at Warm→Cold: 19 ms < 1 s, but 
`local_endpoint_service_name = gateway.sample-services` matches → kept. At Cold 
compaction: no 5xx → **dropped from Cold**.
+
+    - A typical p50 mesh span (~2 ms) is kept at gating only via the `0.05` 
sample. At Warm→Cold: 2 ms < 1 s, not a gateway span, no 5xx → **dropped at 
Warm→Cold**.
+
+    - Any span carrying a `5xx` in its `query` attributes is sure-kept at 
gating, kept at Warm→Cold via the `query` tag matcher, and kept at Cold 
compaction via the same matcher — preserved for the full Cold TTL. Whole 
Warm/Cold segments are still reclaimed on their own schedule by 
`LifecycleStage.ttl`.
+
+## Post-Trace Data Flow Architecture
+
+The execution of the post-trace pipeline occurs natively on BanyanDB Data 
Nodes via three distinct pathways, matching storage lifecycle transitions.
+
+```Plain Text
+[ Raw Ingestion ] -> (Liaison Node) -> [ Fast Local Write ] -> (Data Node 
Memory Buffer)
+                                                                       |
++----------------------------------------------------------------------+
+|
+v
+[ Storage Node Post-Trace Processing Loops ]
+|
++---> 8.1. LSM Compaction Merge Filter Hook (in-line filter during merge; 
lossless unless targeted)
+|
++---> 8.2. Pre-Migration Filter Rewrite (rewrite segment to a reduced part, 
then migrate it unchanged)
+|
++---> 8.3. Scheduled Final Filter on a Settled Segment (watermark past window 
+ grace; best-effort final)
+```
+
+### 8.1 LSM Compaction Merge Phase Loop (In-line Merge Filter Hook)
+
+During LSM file compaction, immature part files containing segmented trace 
blocks are merged into unified, consolidated parts. The merge loop already 
streams blocks grouped by `trace_id`, so it is the ideal place to evaluate 
traces without a separate read pass.
+
+> **Design decision (accepted):** The trace LSM merger is refactored to expose 
a **trace-filter hook**. The pipeline plugs a filter into the existing merge 
stream so that, when a `TracePipelineConfig` with a `StageRule` whose 
`apply_at` includes `STAGE_EVENT_COMPACTION` (or is empty) targets the schema 
being compacted, blocks belonging to dropped traces are omitted from the 
consolidated output. Without a targeting pipeline the hook is a no-op and the 
merge stays byte-for-byte lossless, preserving the current LSM correctness 
guarantee.
+
+#### Integration point in the real merger
+
+The hook is injected into `mergeBlocks` (`banyand/trace/merger.go`), which is 
the single point where every emitted block flows to the block writer. Both 
write paths are wrapped by the filter:
+
+- `blockWriter.mustWriteRawBlock` — the fast path that copies a single-block 
trace as raw bytes without decoding.
+- `blockWriter.mustWriteBlock` — the slow path that emits a decoded, 
accumulated block for a `trace_id`.
+
+```Plain Text
+[ Immature Parts ] --> [ blockReader.nextBlockMetadata ] (streams blocks 
ordered by trace_id)
+          |
+          v
+[ Per-trace assembly (existing pending-block accumulation in mergeBlocks) ]
+          |
+          v
+[ Mature?  traceMaxTs (timestampsMetadata.max) < now - merge_grace ]
+     |                                        |
+     | NO (trace may still grow)              | YES (stopped growing)
+     v                                        v
+[ RETAIN unchanged ]                    [ TraceFilter hook ] --- keyed on 
trace_id, cached
+                                             |   - evaluate the stage's 
StageRule predicates
+                                             |     (min_duration / keep_errors 
/ keep_tag_rules)
+                                             |
+                                             +--> RETAIN --> mustWriteBlock / 
mustWriteRawBlock
+                                             |
+                                             +--> DROP --> blocks skipped; 
spans, tags, primary
+                                                           entries never 
written (space reclaimed
+                                                           in-stream, no 
delete mutation)
+```
+
+#### Filter contract and what the merge already gives us for free
+
+1. **Duration is free; errors and tags cost a decode.** Block metadata carries 
`timestampsMetadata{min,max}` (`block_metadata.go`), so `D_total` is derivable 
on the raw fast path without unmarshaling — `min_duration` checks are free. 
`keep_errors` and `keep_tag_rules` predicates inspect span bodies, so any 
pipeline using them forces the decoded slow path (`loadBlockData`) for the 
targeted schema. Duration-only stage rules can keep the raw fast path.
+
+2. **Decisions are per `trace_id`, not per block.** A single trace may be 
emitted as multiple blocks when its accumulated span size crosses 
`maxUncompressedSpanSize`. The filter computes one retain/drop verdict per 
`trace_id` and applies it to **every** block carrying that id, so a trace is 
never partially written.
+
+3. **A trace is dropped only after it stops growing (per-trace grace).** 
Dropping blocks makes a compaction intentionally lossy, and compaction is 
triggered by part accumulation (`getPartsToMerge`), not by time — so a merge 
routinely runs on the active write window while a trace's remaining spans are 
still seconds away. Dropping such a trace on its partial spans would orphan the 
late ones. The filter therefore evaluates a trace only once its **latest span 
timestamp is older than `now − merge_grace`** (the trace's max timestamp is 
read for free from `timestampsMetadata.max`, point 1); traces newer than that 
frontier are passed through the merge unchanged. The filter is additionally 
gated on targeting — consulted only when the snapshot's group/schema matches a 
pipeline with a `StageRule` whose `apply_at` includes `STAGE_EVENT_COMPACTION`; 
otherwise the default compaction path is unchanged. This `merge_grace` is a 
**per-trace** maturity window, distinct from the **per-segment** `
 finalize_grace` of §8.3: the former bounds intra-trace span spread, the latter 
bounds segment-wide late arrival. Both are fields of `TracePipelineConfig` 
(`merge_grace`, `finalize_grace`); the engine applies documented defaults 
(`30s` and `5m` respectively) when a field is unset.
+
+4. **Derived part state must reflect only retained traces.** When a trace is 
dropped, its entries are excluded from `partMetadata` counts, the 
`traceIDFilter` bloom filter (`mustWriteTraceIDFilter`), and the `tagType` set 
written at `Flush`. The filter runs before these are finalized so the 
consolidated part stays self-consistent.
+
+5. **Secondary indexes are pruned in lockstep.** 
`mergePartsThenSendIntroduction` merges sidx parts separately from the trace 
blocks (`banyand/trace/merger.go`). The set of dropped `trace_id`s is 
propagated to the sidx merge so dropped traces are not left dangling in 
secondary indexes.
+
+6. **Drops are final and the merge is crash-safe.** Because a trace is dropped 
only after `merge_grace` has elapsed since its last span (point 3), the verdict 
acts on a trace that has stopped growing, so the drop is final rather than 
premature. The merge writes the new part atomically and only retires source 
parts after the introduction is applied, so a crash mid-merge leaves the 
immature parts intact for retry; re-running compaction on an already-filtered 
part is a no-op.
+
+### 8.2 Hot-to-Warm Tier Migration (Pre-Migration Filter Rewrite)
+
+When the lifecycle agent migrates a segment to the next stage, the existing 
transfer is an opaque byte stream of part directories — it is not a per-trace 
filter. Today's path walks shards via `traceMigrationVisitor.VisitShard` 
(`banyand/backup/lifecycle/trace_migration_visitor.go`), reads source parts 
with `generateAllPartData` (which opens part files via 
`trace.CreatePartFileReaderFromPath`, `banyand/trace/part.go`), and ships them 
with `streamPartToTargetShard` over a chunked sync client to the destination 
replicas.
+
+> **Design decision (accepted):** Do **not** filter inside the byte-copy 
transfer. Instead, run a **pre-migration filter pass** on the source segment 
that reads the existing parts through the pipeline and writes a **new, reduced 
part**; the unchanged migration then streams that new part to the next stage. 
This keeps "migration" a faithful byte copy and isolates all lossy behaviour in 
a dedicated rewrite step.
+
+```Plain Text
+[ Hot Tier segment (settled; >= stage TTL old) ]
+        |
+        v
+[ Pre-migration rewrite pass ]  (new visitor over the segment's parts)
+        |  read each part: blockReader over CreatePartFileReaderFromPath
+        |  apply pipeline filter (same trace-filter contract as §8.1)
+        |  write retained traces to a NEW reduced part via blockWriter
+        v
+[ Reduced part on Hot tier ]
+        |
+        v
+[ Existing migration (unchanged) ]  VisitShard -> generateAllPartData -> 
streamPartToTargetShard
+        |  opaque chunked byte copy of the reduced part
+        v
+[ Warm tier segment ]
+```
+
+1. The pre-migration pass attaches between `generateAllPartData` and 
`streamPartToTargetShard` in `VisitShard`: each source part is read with the 
same `blockReader`/`blockWriter` machinery the merger uses (§8.1), the pipeline 
filter is applied, and a reduced part is produced in place of the source.
+
+2. The filter is the same `STAGE_EVENT_MIGRATION_OUT`-gated trace-filter 
contract as §8.1 — per-`trace_id` retain/drop verdicts, with dropped traces 
excluded from `partMetadata`, the `traceIDFilter` bloom filter, `tagType`, and 
the parallel sidx parts. Because a segment only migrates once it is at least 
its stage TTL old (e.g. a Hot segment after ~1 day), its event-time window 
closed long ago and any late arrivals have settled, so every trace is 
effectively complete and the verdict is final.
+
+3. The reduced part — not the original — is what `streamPartToTargetShard` 
transfers. The migration protocol, replica fan-out, and destination 
registration are untouched, so the considerable savings come purely from 
shipping fewer bytes to the next tier.
+
+### 8.3 Scheduled Final Filter on a Settled Segment
+
+Filtering during an LSM merge (§8.1) sees only the spans present in the parts 
being merged; a trace whose remaining spans are still arriving could be judged 
prematurely. The **final** tail-sampling decision must therefore run only once 
a segment is unlikely to gain more spans. The catch: **BanyanDB has no "segment 
sealed" signal** to trigger this, so the pass is driven by an event-time 
watermark and a grace period instead.
+
+Why there is no seal event (grounding in 
`banyand/internal/storage/rotation.go`):
+
+- **Writes route by event time, so old segments keep receiving data.** A span 
is written into the segment whose window contains its `start_time` 
(`banyand/trace/write_standalone.go` → `CreateSegmentIfNotExist(time.Unix(0, 
ts))`), not its arrival time. A span that arrives hours late still lands in its 
old event-time segment for as long as that segment exists.
+- **Segment creation is look-ahead, not rollover.** The rotation loop 
pre-creates the *next* segment up to `creationGap` (1 hour) **before** the 
current window ends: it fires `segmentController.create` only while `0 < 
latest.End - eventTime <= newSegmentTimeGap`. At that moment the current 
segment is still the active write target, so `create()` cannot mean "the 
previous segment is sealed."
+- **`closeIdleSegments` is not a seal.** It only releases idle in-memory 
handles after an idle timeout (`banyand/internal/storage/segment.go`) and 
reopens them on the next access; it says nothing about window completeness.
+- **The only definitive boundary is TTL removal** (`retentionTask`, cron `5 
0`), which deletes the whole segment — far too late to be a finalization 
trigger.
+
+> **Design decision (accepted):** Drive the final filter from an **event-time 
watermark plus a settling grace period**, not from a (non-existent) seal event. 
The rotation path already tracks the maximum observed event time 
(`latestTickTime`, fed by `Tick`, `rotation.go`). A segment is treated as 
**settled** once `watermark > segment.End + finalize_grace`, where 
`finalize_grace` is `TracePipelineConfig.finalize_grace` — a configured 
expected-late-arrival lag (engine default `5m` if unset), distinct from the 
per-trace `merge_grace` of §8.1. A new post-trace scheduler — registered on 
`pkg/timestamp.Scheduler` (cron-backed, like `retentionTask`) — periodically 
scans for settled-but-unfinalized segments and runs the final filter once per 
segment. This is the proto's `STAGE_EVENT_FINALIZE`. It is **best-effort 
final**, not a hard guarantee: data arriving after `finalize_grace` is missed 
(§3.1, caveat 1). The finalization pass is also the point where the 
tail-sampling gate (`tail_s
 ampling`) is applied — traces failing both the gate and any 
`STAGE_EVENT_FINALIZE` `StageRule` predicates are dropped here.

Review Comment:
   The finalization description says traces are dropped only if they fail both 
the tail-sampling gate and any `STAGE_EVENT_FINALIZE` `StageRule` predicates, 
which implies StageRule can override a tail-sampling drop. Earlier sections 
describe tail-sampling as deciding whether a trace is retained at all; please 
clarify the exact composition/order at FINALIZE (e.g., is tail_sampling an 
absolute gate, or is retention the union of tail_sampling and StageRule 
sure-keeps?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: post-trace pipeline design — tail-sampling + per-stage rules [skywalking-banyandb]

Reply via email to