Copilot commented on code in PR #1144: URL: https://github.com/apache/skywalking-banyandb/pull/1144#discussion_r3316216950
########## api/proto/banyandb/pipeline/v1/trace_pipeline.proto: ########## @@ -0,0 +1,187 @@ +// Licensed to Apache Software Foundation (ASF) under one or more contributor +// license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright +// ownership. Apache Software Foundation (ASF) licenses this file to you under +// the Apache License, Version 2.0 (the "License"); you may +// not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto3"; + +package banyandb.pipeline.v1; + +import "banyandb/common/v1/common.proto"; +import "google/protobuf/duration.proto"; +import "validate/validate.proto"; + +option go_package = "github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1"; +option java_package = "org.apache.skywalking.banyandb.pipeline.v1"; + +// StageEvent enumerates the lifecycle events of a stage at which the filter runs. +enum StageEvent { + STAGE_EVENT_UNSPECIFIED = 0; + // In-line filter during an LSM merge within a stage. + STAGE_EVENT_COMPACTION = 1; + // Scheduled final filter once a segment has settled: its event-time window + // has closed and the watermark has advanced past it by a grace period. + // (BanyanDB has no hard "seal" event; settling is watermark-driven.) + STAGE_EVENT_FINALIZE = 2; + // Pre-migration rewrite when a segment leaves this stage for the next. + STAGE_EVENT_MIGRATION_OUT = 3; +} + +// TracePipelineConfig is the root configuration for a storage-node trace pipeline. +// It reuses existing catalog identifiers (group via metadata, stage names, schema +// names) for targeting instead of declaring a parallel metadata model. +message TracePipelineConfig { + // Identity and revision tracking; metadata.group is the Group this pipeline + // lives in and applies to, consistent with every other schema resource. + common.v1.Metadata metadata = 1; + // Active status of the pipeline. + bool enabled = 2; + // Per-stage retention rules: which lifecycle stages this pipeline acts on, + // with the keep-predicates and lifecycle events for each. Empty means the + // tail-sampling gate at finalization is the only filter; no per-stage drop + // is applied. + repeated StageRule stages = 3; + // Explicit schema names to target within the Group (exact match on Metadata.name). + repeated string schema_names = 4; + // RE2 regular expression matched against schema names. A schema is targeted if it + // is listed in schema_names OR matches this pattern. When both are empty, every + // schema in the Group is targeted. + string schema_name_regex = 5; + // Tail-sampling (gating) rules applied at finalization (§8.3) to decide + // whether a trace is retained at all once its segment has settled. + TailSampling tail_sampling = 6; + // Per-trace maturity window for the in-line merge filter (§8.1). A trace is + // eligible for dropping during an LSM compaction merge only once its latest + // span timestamp is older than `now - merge_grace`; younger traces pass + // through the merge unchanged. Bounds the expected intra-trace span arrival + // spread (typically seconds). Must be strictly positive if set; if unset, the + // engine uses a documented default (currently 30s). + google.protobuf.Duration merge_grace = 7 [(validate.rules).duration = { + gt: {seconds: 0} + }]; + // Per-segment settling window for the scheduled finalization pass (§8.3). A + // segment is treated as settled, and the authoritative final filter runs, + // once the event-time watermark exceeds `segment.End + finalize_grace`. + // Bounds segment-wide late arrival (typically minutes). Must be strictly + // positive if set; if unset, the engine uses a documented default + // (currently 5m). Distinct from `merge_grace`: `merge_grace` is per-trace, + // `finalize_grace` is per-segment. + google.protobuf.Duration finalize_grace = 8 [(validate.rules).duration = { + gt: {seconds: 0} + }]; +} + +// StageRule binds the pipeline to one lifecycle stage of the targeted Group +// and declares per-stage retention predicates. +// +// Semantics (when the rule fires at one of its `apply_at` events): +// - If every predicate field is unset (no `min_duration`, no `keep_errors`, +// and an empty `keep_tag_rules`) the rule has no filtering effect — every +// trace at this stage is retained, and the tail-sampling gate at +// finalization remains the only filter affecting traces here. +// - Otherwise the *set* predicates are combined with OR: a trace is retained +// if any single set predicate matches it; a trace that fails all set +// predicates is dropped from this stage at the firing event. +// +// Within `keep_tag_rules`, the individual `TagMatcher` entries are likewise +// OR-combined (any matcher match satisfies the keep_tag_rules predicate). +message StageRule { + // Stage name from the Group's ResourceOpts.stages (e.g. "hot", "warm", "cold"). + string stage = 1; + // Which of this stage's lifecycle events run the filter. Empty means all of Review Comment: Several string identifier fields are currently unconstrained, allowing empty stage names or tag keys. Elsewhere in the repo, identifier strings are validated with `string.min_len = 1` (e.g., `common.v1.LifecycleStage.name`). Adding the same validations here prevents configs that can't be meaningfully applied. ########## api/proto/banyandb/pipeline/v1/trace_pipeline.proto: ########## @@ -0,0 +1,187 @@ +// Licensed to Apache Software Foundation (ASF) under one or more contributor +// license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright +// ownership. Apache Software Foundation (ASF) licenses this file to you under +// the Apache License, Version 2.0 (the "License"); you may +// not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto3"; + +package banyandb.pipeline.v1; + +import "banyandb/common/v1/common.proto"; +import "google/protobuf/duration.proto"; +import "validate/validate.proto"; + +option go_package = "github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1"; +option java_package = "org.apache.skywalking.banyandb.pipeline.v1"; + +// StageEvent enumerates the lifecycle events of a stage at which the filter runs. +enum StageEvent { + STAGE_EVENT_UNSPECIFIED = 0; + // In-line filter during an LSM merge within a stage. + STAGE_EVENT_COMPACTION = 1; + // Scheduled final filter once a segment has settled: its event-time window + // has closed and the watermark has advanced past it by a grace period. + // (BanyanDB has no hard "seal" event; settling is watermark-driven.) + STAGE_EVENT_FINALIZE = 2; + // Pre-migration rewrite when a segment leaves this stage for the next. + STAGE_EVENT_MIGRATION_OUT = 3; +} + +// TracePipelineConfig is the root configuration for a storage-node trace pipeline. +// It reuses existing catalog identifiers (group via metadata, stage names, schema +// names) for targeting instead of declaring a parallel metadata model. +message TracePipelineConfig { + // Identity and revision tracking; metadata.group is the Group this pipeline + // lives in and applies to, consistent with every other schema resource. + common.v1.Metadata metadata = 1; Review Comment: `TracePipelineConfig.metadata` is the identity field for this resource, but it isn't marked as required like other BanyanDB schema resources. Without `message.required`, an empty config can be accepted by PGV even though the server will need `metadata` for name/group/revision handling. ########## api/proto/banyandb/pipeline/v1/trace_pipeline.proto: ########## @@ -0,0 +1,187 @@ +// Licensed to Apache Software Foundation (ASF) under one or more contributor +// license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright +// ownership. Apache Software Foundation (ASF) licenses this file to you under +// the Apache License, Version 2.0 (the "License"); you may +// not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto3"; + +package banyandb.pipeline.v1; + +import "banyandb/common/v1/common.proto"; +import "google/protobuf/duration.proto"; +import "validate/validate.proto"; + +option go_package = "github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1"; +option java_package = "org.apache.skywalking.banyandb.pipeline.v1"; + +// StageEvent enumerates the lifecycle events of a stage at which the filter runs. +enum StageEvent { + STAGE_EVENT_UNSPECIFIED = 0; + // In-line filter during an LSM merge within a stage. + STAGE_EVENT_COMPACTION = 1; + // Scheduled final filter once a segment has settled: its event-time window + // has closed and the watermark has advanced past it by a grace period. + // (BanyanDB has no hard "seal" event; settling is watermark-driven.) + STAGE_EVENT_FINALIZE = 2; + // Pre-migration rewrite when a segment leaves this stage for the next. + STAGE_EVENT_MIGRATION_OUT = 3; +} + +// TracePipelineConfig is the root configuration for a storage-node trace pipeline. +// It reuses existing catalog identifiers (group via metadata, stage names, schema +// names) for targeting instead of declaring a parallel metadata model. +message TracePipelineConfig { + // Identity and revision tracking; metadata.group is the Group this pipeline + // lives in and applies to, consistent with every other schema resource. + common.v1.Metadata metadata = 1; + // Active status of the pipeline. + bool enabled = 2; + // Per-stage retention rules: which lifecycle stages this pipeline acts on, + // with the keep-predicates and lifecycle events for each. Empty means the + // tail-sampling gate at finalization is the only filter; no per-stage drop + // is applied. + repeated StageRule stages = 3; + // Explicit schema names to target within the Group (exact match on Metadata.name). + repeated string schema_names = 4; + // RE2 regular expression matched against schema names. A schema is targeted if it + // is listed in schema_names OR matches this pattern. When both are empty, every + // schema in the Group is targeted. + string schema_name_regex = 5; + // Tail-sampling (gating) rules applied at finalization (§8.3) to decide + // whether a trace is retained at all once its segment has settled. + TailSampling tail_sampling = 6; + // Per-trace maturity window for the in-line merge filter (§8.1). A trace is + // eligible for dropping during an LSM compaction merge only once its latest + // span timestamp is older than `now - merge_grace`; younger traces pass + // through the merge unchanged. Bounds the expected intra-trace span arrival + // spread (typically seconds). Must be strictly positive if set; if unset, the + // engine uses a documented default (currently 30s). + google.protobuf.Duration merge_grace = 7 [(validate.rules).duration = { + gt: {seconds: 0} + }]; + // Per-segment settling window for the scheduled finalization pass (§8.3). A + // segment is treated as settled, and the authoritative final filter runs, + // once the event-time watermark exceeds `segment.End + finalize_grace`. + // Bounds segment-wide late arrival (typically minutes). Must be strictly + // positive if set; if unset, the engine uses a documented default + // (currently 5m). Distinct from `merge_grace`: `merge_grace` is per-trace, + // `finalize_grace` is per-segment. + google.protobuf.Duration finalize_grace = 8 [(validate.rules).duration = { + gt: {seconds: 0} + }]; +} + +// StageRule binds the pipeline to one lifecycle stage of the targeted Group +// and declares per-stage retention predicates. +// +// Semantics (when the rule fires at one of its `apply_at` events): +// - If every predicate field is unset (no `min_duration`, no `keep_errors`, +// and an empty `keep_tag_rules`) the rule has no filtering effect — every +// trace at this stage is retained, and the tail-sampling gate at +// finalization remains the only filter affecting traces here. +// - Otherwise the *set* predicates are combined with OR: a trace is retained +// if any single set predicate matches it; a trace that fails all set +// predicates is dropped from this stage at the firing event. +// +// Within `keep_tag_rules`, the individual `TagMatcher` entries are likewise +// OR-combined (any matcher match satisfies the keep_tag_rules predicate). +message StageRule { + // Stage name from the Group's ResourceOpts.stages (e.g. "hot", "warm", "cold"). + string stage = 1; + // Which of this stage's lifecycle events run the filter. Empty means all of + // them (compaction merges, segment finalization, and outbound migration). + repeated StageEvent apply_at = 2; + // Minimum total trace duration to retain at this stage. A trace whose + // duration is at least min_duration is retained. Strictly positive if set; + // if unset, duration is not a retention factor for this stage. + google.protobuf.Duration min_duration = 3 [(validate.rules).duration = { + gt: {seconds: 0} + }]; + // If true, traces with any error span are retained at this stage regardless + // of duration or tag predicates. + bool keep_errors = 4; + // Sure-keep tag matchers: a trace satisfies this predicate when ANY of the + // matchers matches any of its spans (matchers are OR-combined). Distinct + // from TailSampling.tag_rules (which is for ingest-time gating with a + // sample_rate); TagMatcher has no sample_rate — a match is an absolute keep. + repeated TagMatcher keep_tag_rules = 5; +} + +// TailSampling manages tail-sampling (ingest-gating) decisions at finalization. +message TailSampling { + // Absolute trace duration threshold. Traces whose total latency meets or + // exceeds this value are preserved (a sure-keep at gating). Required and + // strictly positive. + google.protobuf.Duration duration_threshold = 1 [(validate.rules).duration = { + required: true + gt: {seconds: 0} + }]; + // Immediate retention rule when any span contains an error state. + bool keep_all_errors = 2; + // Probabilistic keep rate for healthy, low-latency traces that match no + // sure-keep rule. The decision is a deterministic hash(trace_id) < rate, so + // it is stable across re-evaluation. Must be in [0.0, 1.0]. + double healthy_sample_rate = 3 [(validate.rules).double = { + gte: 0.0 + lte: 1.0 + }]; + // Conditions based on tags for immediate sampling decisions. + repeated TagSamplingRule tag_rules = 4; +} + +// TagSamplingRule keeps traces whose tag value satisfies a matcher, with an +// optional probabilistic sample rate (used by the tail-sampling gate). +message TagSamplingRule { + string tag_key = 1; + // Matching condition against the tag's value. + oneof matcher { + // Exact string equality. + string equals = 2; + // Value is one of a set. + StringList in = 3; + // RE2 regular expression match. + string regex = 4; + // Tag is present with any value. + bool exists = 5; + } + // Probabilistic keep rate applied when the matcher matches (deterministic + // hash(trace_id) < sample_rate); set to 1.0 to make a match an absolute sure-keep. + double sample_rate = 6 [(validate.rules).double = { + gte: 0.0 + lte: 1.0 + }]; +} + +// TagMatcher is a sure-keep tag predicate used by StageRule.keep_tag_rules. +// Unlike TagSamplingRule, it carries no sample_rate — a match is an absolute keep. +message TagMatcher { + string tag_key = 1; + // Matching condition against the tag's value. Review Comment: `TagMatcher.tag_key` should be non-empty for the same reason as `TagSamplingRule.tag_key`; an empty key yields a config that can't match any tag. ########## docs/design/backlog/post-trace-scoring.md: ########## @@ -0,0 +1,113 @@ +# Backlog: Scoring-Based Post-Trace Retention + +> **Status: Deferred.** Removed from the main post-trace pipeline design ([`../post-trace-pipeline.md`](../post-trace-pipeline.md)) in favor of per-stage hard retention predicates (`StageRule.min_duration`, `keep_errors`, `keep_tag_rules`). This document preserves the previously-proposed weighted scoring model — formula, persistence, revision pinning, and proto messages — for potential future revival if a workload demonstrates a genuine need for weighted multi-signal retention. + +## Why deferred + +Scoring's appeal is collapsing multiple value signals (duration, errors, tag matches) into one weighted number per trace, with each stage setting a single retention threshold. Its costs, weighed against the simpler per-stage hard-predicate model, were: + +- **Opacity.** "Why was this trace dropped?" requires evaluating the formula against pinned weights and thresholds. +- **Translation pain for simple cases.** Expressing "Hot keeps duration > 50 ms" as a score threshold requires `log10(50/D_threshold + 1)` — a terrible UX for what should be a direct cutoff. +- **Magic-number tuning.** Operators must calibrate weights such that the threshold lands meaningfully across all trace categories. +- **Debug friction.** Predicate-based retention is self-explanatory; scoring requires simulation. + +For the workloads currently targeted (SkyWalking native and Istio/Zipkin mesh traces, per the showcase), retention intent is well captured by per-stage hard predicates. Scoring's only genuine win — relative weighting of categories ("error_weight is 4× duration_weight") — is rarely needed in practice. If a workload arrives where it *is* needed, this document is the starting point for revival. + +## Architectural role (former §1.2) + +In the scoring model, gating (tail-sampling) decides whether a trace block survives at all, and scoring then governs **where it lives and how long it is kept** within whatever stage holds it: + +- **Operation Type**: Continuous utility function (scales from $0.0$ to $\infty$). +- **Responsibility**: Multi-tier storage routing, TTL scheduling, downstream analytical profiling. +- **Compute Profile**: Moderate CPU overhead — weighted multi-attribute logarithmic curves and dynamic tag-indexing comparisons. + +## The scoring formula + +The base score, computed after the trace's segment has settled ($t_{\text{final}}$), is: + +$$S_{\text{trace}}(t_{\text{final}}) = w_d \cdot \log_{10}\left( \frac{D_{\text{total}}}{D_{\text{threshold}}} + 1 \right) + w_e \cdot \mathbb{I}(\text{has\_error}) + \sum_{i} w_{\text{tag}, i} \cdot \mathbb{I}(\text{Tag}_i = V_i)$$ + +Where: + +- $D_{\text{total}}$: The total duration (latency) of the trace — timestamp difference between the earliest span start and the latest span end in the grouped trace block, clamped to $\max(0, D_{\text{total}})$ so cross-host clock skew cannot yield a negative value or an undefined logarithm. With the clamp, $D_{\text{total}}/D_{\text{threshold}} + 1 \ge 1$, so the duration term is always $\ge 0$. +- $D_{\text{threshold}}$: The base duration threshold (e.g. $500\text{ ms}$). +- $w_d$: Duration weight (`duration_weight`). +- $w_e$: Error weight (`error_weight`). +- $\mathbb{I}(\text{has\_error})$: $1$ if at least one span has its error state set, else $0$. +- $w_{\text{tag}, i}$: Weight of the $i$-th target tag (`tag_weights`). +- $\mathbb{I}(\text{Tag}_i = V_i)$: $1$ if any span in the trace carries that key/value pair, else $0$. + +## Score ranges and boundaries + +- **Lower bound:** $S_{\text{trace}} = 0.0$, reached when the clamped duration is $0$, no span carries an error, and no tag matches. A zero- or negative-duration trace lands on this $0$ floor. +- **Upper bound:** Theoretically $[0.0, \infty)$; in practice $[0.0, 30.0]$ under typical production weights given normal request-timeout ceilings (≤ 60 s). + +### Worked example (real `sw_trace` data) + +Using the original Scenario 7.1 weights — $D_{\text{threshold}} = 500\text{ ms}$, $w_d = 2.0$, $w_e = 12.0$: + +- **Slowest healthy trace** — `trace_id b03bb932-…-63a35cb317f7`, `/homepage` on `agent::ui` → `agent::frontend`, $D_{\text{total}} = 2802\text{ ms}$, no error span: + +$$S_{\text{trace}} = 2.0 \cdot \log_{10}\left( \frac{2802}{500} + 1 \right) + 12.0 \cdot (0) = 2.0 \cdot \log_{10}(6.604) \approx 1.64$$ + +- **Error trace** — `trace_id 5fcdb353-…-ec8cb4a1c88a`, `POST /test` on `agent::app`, $D_{\text{total}} = 4\text{ ms}$, `is_error = 1`: + +$$S_{\text{trace}} = 2.0 \cdot \log_{10}\left( \frac{4}{500} + 1 \right) + 12.0 \cdot (1) \approx 12.01$$ + +The error term dominates: even a $4\text{ ms}$ failed request outscores a genuinely slow ($2.8\text{ s}$) but healthy page load. + +## Stage-stepped scoring thresholds + +The model is **no continuous decay**: a trace's score is computed once at finalization and never decays. "Aging" is expressed structurally — each stage sets its own `retention_score_threshold`, and the bar rises as data moves to colder stages. A typical profile rises Hot → Warm → Cold (e.g. `1.0` → `3.0` → `10.0`): a high-value trace clears every bar and reaches Cold; a low-value trace is dropped at the first stage whose threshold it fails. One configured number per stage, no power-law constants, no per-trace recomputation. Time enters only through *when* a partition reaches each stage, governed by `LifecycleStage.ttl` / `segment_interval`. + +## Score persistence and pipeline revision + +"Computed once and frozen" is only safe if both the frozen value and the configuration that produced it are physically persisted alongside each trace. + +**Per-trace post-trace metadata stream.** The finalization pass writes one record per surviving trace into a per-trace metadata stream inside the part, parallel to the existing `spans`, `tags`, and `primary` streams. Each record is keyed by `trace_id` and is fully self-contained: + +- `score` (`float64`) — the frozen $S_{\text{trace}}(t_{\text{final}})$ value. +- `pipeline_revision` (`int64`) — the value of `TracePipelineConfig.metadata.modRevision` of the pipeline that produced the score (audit/debug). +- `stage_thresholds` (`map<string, float64>`) — the resolved per-stage `retention_score_threshold` map at that revision (e.g. `{hot: 1.0, warm: 3.0, cold: 10.0}`). Pinning the thresholds — not just the revision id — keeps each trace's retention destiny self-describing, with no dependency on retained pipeline-config history at comparison time. + +Engine-managed (not a user-declared tag), so no `Trace` schema change is required. Overhead is on the order of tens of bytes per trace. + +**Rewrites preserve all three values.** Compaction merges and the pre-migration rewrite copy per-trace columns by `trace_id` via the existing `blockReader` / `blockWriter` machinery; the post-trace metadata stream is carried alongside `spans` / `tags` / `primary`. Subsequent rewrites never recompute `score` from spans on a finalized trace. + +**Read paths.** Pre-migration is the primary consumer: by the time a segment migrates it is past its stage TTL ≫ `finalize_grace`, so every surviving trace already has a record. The compaction filter consults the record when present; for the brief window in which a trace is mature (per `merge_grace`) but its segment has not yet been finalized, no record exists yet — the filter computes `score` on the fly and uses the current pipeline's threshold (this matches what the later finalization will pin). + +**Config-edit semantics: forward-looking, never retroactive.** Editing `TracePipelineConfig` advances `metadata.modRevision`; traces finalized under an earlier revision keep their pinned `score`, `pipeline_revision`, and `stage_thresholds`. Retroactive re-evaluation requires an explicit operator action. + +## Proto schema (for revival) + +If revived, scoring would reintroduce these messages to `api/proto/banyandb/pipeline/v1/trace_pipeline.proto`: + +```Plain Text +message ScoringRules { + double duration_weight = 1 [(validate.rules).double = {gte: 0.0}]; + double error_weight = 2 [(validate.rules).double = {gte: 0.0}]; + repeated TagWeight tag_weights = 3; +} + +message TagWeight { + string key = 1; + string value = 2; + double weight = 3 [(validate.rules).double = {gte: 0.0}]; +} +``` + +And `StageRule` would carry a per-stage threshold (field 2 is reserved in the current proto exactly so this revival is wire-safe): + +```Plain Text +double retention_score_threshold = 2 [(validate.rules).double = {gte: 0.0}]; +``` + +Plus a `ScoringRules scoring_rules = 7;` field on `TracePipelineConfig` (field 7 is reserved in the current proto for the same reason). + Review Comment: This section claims `StageRule` field 2 and `TracePipelineConfig` field 7 are reserved for scoring revival, but in the current proto `StageRule` field 2 is `apply_at` and `TracePipelineConfig` field 7 is `merge_grace`. As written, this doc would mislead future implementers about wire-compatibility constraints. ########## docs/design/post-trace-pipeline.md: ########## @@ -0,0 +1,493 @@ +# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping Specification + +This document presents the complete technical design for the **BanyanDB Storage-Node Post-Trace Pipeline** and its integration with physical hardware storage tiers and time-aging schedulers. + +Unlike traditional streaming-based telemetry engines that perform span assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a **native trace model**. Spans are stored, sorted, and indexed by `trace_id` directly within the storage engine layout on individual data nodes. + +By decoupling trace assembly from the active ingestion path, the post-trace pipeline executes asynchronous tail-sampling, per-stage retention filtering, and data reduction natively during storage lifecycle events on the data nodes. This design is grounded in recent academic advancements in post-hoc retroactive tracing and storage-level file merge analysis. + +## Architectural Concept: Decoupled Gating and Per-Stage Retention + +To maximize storage efficiency and compute performance on the BanyanDB data nodes, trace evaluation is separated into two logical phases, which subsequently feed into an automated **time-aging system** for partition-level migration: + +```Plain Text ++-----------------------------------+ + | Grouped Trace Assembly | + +-----------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 1. Tail-Sampling Gating (tail_sampling, at finalization) | +| - Goal: Decide whether a trace is retained at all | +| - Criteria: sure-keep rules (duration / errors / tags) + sampling | ++-----------------------------------------------------------------------+ + | + +-----------------+-----------------+ + | (Retain) | (Drop / Purge) + v v ++---------------------------------------------------+ +-----------------+ +| 2. Per-Stage Retention (StageRule predicates) | | Discard Block | +| - Goal: At each stage's lifecycle event, | | Reclaim Space | +| keep / drop per the stage's predicates | +-----------------+ +| - Criteria: min_duration / keep_errors / | +| keep_tag_rules, per stage | ++---------------------------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 3. Time-Aging System (Stage-Stepped Migration Engine) | +| - Goal: Each stage's StageRule governs survival at its lifecycle | +| events (compaction, finalization, migration_out) | +| - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order) | +| - RULE: medium is a node-group concern (may be heterogeneous); | +| predicates govern per-stage retention, not medium selection. | ++-----------------------------------------------------------------------+ +``` + +### 1.1 Gating (`tail_sampling`) + +- **Operation Type**: Tail-sampling filter (keep/drop decision at finalization). It combines absolute *sure-keep* rules with a hashed probabilistic sample of everything else; it is not a pure boolean classifier. + +- **Responsibility**: Initial filtering at the finalization pass (§8.3). It determines whether an assembled trace block **survives** at all — passing into the stage lifecycle — or is **purged** immediately to reclaim storage space on disk. + +- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules (duration boundaries, error presence, tag match) plus a single `hash(trace_id) < healthy_sample_rate` comparison. The probabilistic keep is a deterministic `hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / drop decision is stable only once the trace has stopped growing: a partial trace's duration is a lower bound, so any drop is deferred until the trace's last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) then yields the same keep/drop result. + +### 1.2 Per-Stage Retention (StageRule predicates) + +- **Operation Type**: Per-stage keep/drop predicate evaluation. Each `StageRule` declares which trace properties (`min_duration`, `keep_errors`, `keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails every predicate is dropped at the stage's next configured lifecycle event. + +- **Responsibility**: Decides **which traces survive each stage** as data ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed by increasingly strict predicates at each stage's `MIGRATION_OUT` (or `COMPACTION`) event. The pipeline does **not** choose the physical storage medium; that is a node-group / `LifecycleStage` placement concern (§4.1). + +- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata (free, no decode); `has_error` requires a decoded read of span bodies only when `keep_errors` is set; tag matchers run against indexed tags. Predicates are direct boolean checks. + +## Protobuf Message Design + +The pipeline configuration is a single trace-typed message, `TracePipelineConfig`. Rather than a parallel abstract metadata layer, it reuses the existing catalog identifiers — Group (via `metadata`), lifecycle stage names, and schema names — and adds only the trace-specific tail-sampling and per-stage retention rules. General stream parsing and metrics aggregation configurations are excluded to maintain a strict focus on trace-centric analytical operations. + +### 2.1 Targeting Model: Reuse Existing Catalog Identifiers + +This design deliberately does **not** introduce an abstract `Pipeline` resource or an `ExecutionTrigger` enum. Both would re-declare targeting that the storage model already owns and let the two drift. A trace pipeline is instead expressed entirely with identifiers that already exist: + +- **Group** — named by the config's own `metadata.group` (`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that Group, exactly as every other schema resource does. The Group already fixes the `catalog`, so catalog is never repeated. +- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, `keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules are configured here, per stage and per pipeline — there are no hardcoded predicates in the engine. +- **Schema selector** — an explicit `schema_names` list (exact match on `common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches if it is listed OR matches the regex; both empty targets every schema in the Group. + +The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced by **stage lifecycle events**, because each trigger was already anchored to a stage: compaction is an LSM merge within a stage (§8.1), migration is a segment leaving one stage for the next (§8.2), and the scheduled pass finalizes a settled time segment within a stage (§8.3). Each `StageRule` carries its own `apply_at` selecting which of those events run the filter; empty means all of them. This keeps the trace-specific rules out of the generic, catalog-agnostic `common.v1.LifecycleStage` (which stream and measure groups share) while still binding them to stages by name. + +Multiple `TracePipelineConfig`s may coexist in a Group only when their effective coverage is disjoint; the schema registry enforces this with a per-tuple uniqueness rule (§2.3). + +### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`) + +The full message definitions live in the proto source at `api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package `banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is the single root resource: it carries its own identity (`metadata`), the targeting fields from §2.1, and the trace-specific tail-sampling and per-stage retention rules. There is no embedded abstract `Pipeline` and no `ExecutionTrigger`; the lifecycle events that run the filter are named by `apply_at`. + +The message set is: + +- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, the `tail_sampling` (gating) block, and the completeness windows `merge_grace` (§8.1) and `finalize_grace` (§8.3). +- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, `min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are OR-combined: a trace survives if any one matches; it is dropped if it fails all set predicates. A rule with no predicates set at all has no filtering effect (see §4.2). +- **`StageEvent`** — enum of the lifecycle events a rule may run at: `STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and `STAGE_EVENT_MIGRATION_OUT` (§8.2). +- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: `duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` (each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`). +- **`TagMatcher`** — sure-keep tag predicate used by `StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries no `sample_rate` — a match is an absolute keep. + +### 2.3 Uniqueness and Conflict Policy + +Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` tuple**. The schema registry enforces this on write so the filter behavior at any tuple is deterministic — there is no implicit ordering, priority, or composition of overlapping pipelines. + +**Effective coverage of a pipeline.** A `TracePipelineConfig` P with `enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when: + +- `schema` is a schema in `P.metadata.group` whose name is listed in `P.schema_names` or matches `P.schema_name_regex` (both empty selectors target every schema in the Group); +- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every stage in the Group's `ResourceOpts`); +- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers every `StageEvent`). + +**Conflict detection on write.** When a `TracePipelineConfig` is created or updated, the registry expands its effective coverage against the current schemas and stages of its Group and rejects the write if **any** covered tuple is already covered by another `enabled` pipeline in the same Group. The error names the conflicting pipeline and the overlapping tuple(s). + +**Schema additions.** When a new `Trace` schema is created in a Group, the registry re-validates every `enabled` `TracePipelineConfig` in that Group against the new schema. If the new schema would cause two pipelines to overlap on any `(stage, event)`, the schema creation is rejected; the operator must narrow the pipelines (e.g. add explicit `schema_names`) before adding the schema. + +**Disabling and replacement.** Setting `enabled=false` removes a pipeline from active coverage immediately, freeing its tuples for another pipeline to claim. This is the safe path for atomically swapping in a new pipeline: disable the old, enable or create the new, then delete the old. There is no implicit precedence between two `enabled` pipelines — the registry will not pick a winner. + +**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and migration (§8.2) the engine selects the active pipeline by `(group, schema, stage, event)`. If a registry inconsistency ever exposes more than one active match for a tuple, the engine logs an error and applies **none** of them — failing safe to retain rather than risk a nondeterministic destructive drop. + +**Recommended pattern.** For most workloads, a single pipeline per Group with a broad `schema_name_regex` is the simplest configuration. Multiple pipelines in a Group are useful only when their schema selectors are mutually exclusive (for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in a hypothetical mixed group) so no tuple is doubly covered. + +### 2.4 Admission Validation + +The proto pins bounds with PGV so a malformed config is rejected at parse time rather than producing undefined behavior. The schema registry's admission control adds one cross-field rule that PGV cannot express. + +**PGV-enforced bounds** (rejected at proto parse / `Write` time): + +- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 0s`). +- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in `[0.0, 1.0]`. +- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset means duration is not a retention factor for that stage. +- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — strictly positive when set; unset means engine default (§8.1 / §8.3). + +**Admission rule (server-side; not expressible in PGV):** + +- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = true`, the config must declare at least one of: a non-empty `tail_sampling`, or at least one `StageRule` with at least one retention predicate (`min_duration`, `keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with no gating and no per-stage predicates has no effect on traces and is rejected as a configuration error. To temporarily disable a pipeline, set `enabled = false` — do not strip its rule blocks. + +Both the registry write path and the runtime apply these checks. If a registry inconsistency ever lets a malformed config reach the engine, the engine logs an error and falls back to **retain** for every tuple the malformed config would have governed — never a destructive drop (consistent with the §2.3 runtime fail-safe). + +## Retention-Decision Timing & Trace Completeness + +Both the tail-sampling gate (§1.1) and the per-stage retention predicates (§1.2) operate on the **whole** trace, not on individual spans as they arrive: `D_total` needs the earliest start and latest end, `has_error` scans all spans, and tag matchers scan all spans. Spans of one trace, however, arrive at the storage node anywhere from milliseconds to hours apart, and BanyanDB writes each span into the segment selected by its **event time** (the `start_time` tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no live partial-trace buffer; a trace exists only as spans co-located by `trace_id` inside a segment's parts. + +### 3.1 When Retention Decisions Are Safe + +Predicates are not evaluated for destructive drops at write time. They run lazily, by the post-trace passes that re-assemble a trace by `trace_id`: + +- **Provisional** — during an LSM compaction merge (§8.1). This sees only the spans that have landed so far, so a drop verdict could be premature; the per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace has stopped growing. +- **Authoritative** — at the post-trace scheduler's finalization pass once the trace's segment has **settled**: its event-time window has closed *and* the event-time watermark has advanced past it by the `finalize_grace` period (§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates the next segment up to an hour *before* the current window ends, and late data keeps routing to an old segment by event time — so "settled" is a watermark heuristic, not a guarantee. +- **At migration_out** — by the time a segment is migrated to the next stage (≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every trace it holds has been settled, so predicate evaluation is final. + +Two completeness caveats follow from event-time segmentation: + +1. **Spans arriving after finalization.** A span whose `start_time` falls in an already-finalized segment is still accepted (while the segment is within retention) but is missed by the finalization-time evaluation. The `finalize_grace` period bounds how often this happens; a late span that lands after it is only re-evaluated if a subsequent compaction rewrites that part. + +2. **Traces spanning multiple segments.** A trace longer than `segment_interval`, or one that straddles a boundary, is physically split across two segments, and each segment's pass evaluates only its own fragment (`D_total` and the indicators are fragment-local). BanyanDB performs no cross-segment trace assembly at the storage layer. Where trace duration is far below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day segments) the fragment equals the whole trace and this is a non-issue; for long-running traces it is a known limitation. + +## Integration of the Retention System & Time-Aging System + +Distributed telemetry's diagnostic utility naturally decays as time passes. The pipeline must therefore decide, per stage, what to keep — but it must **not** try to pick the physical storage medium, because in BanyanDB the medium is already determined by where a `LifecycleStage` is placed. + +### 4.1 Stage Placement vs. Rule-Governed Retention + +The physical medium is **not** fixed per tier by this design. Each `LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through its `node_selector`; the medium is whatever hardware that node group runs, and operators may deploy heterogeneous backings (for example two warm node groups on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA SSD/HDD, Cold → object storage, but that mapping is an operator deployment choice, not a rule this pipeline enforces. + +Because medium selection already belongs to stage placement, per-stage `StageRule` predicates **do not route traces to a medium**. Within whatever stage currently holds a trace, the predicates only act as a **retention gate** — dictating whether a trace is retained when a partition is rewritten or migrated, or discarded/pruned to reduce storage footprint. + +### 4.2 Stage-Stepped Retention via Rising Predicates + +"Aging" is expressed structurally: each `StageRule` declares its own predicates, and the bar rises as data moves to colder stages. **Semantics of a `StageRule` at one of its `apply_at` events:** the set predicates are OR-combined — a trace is retained if **any** set predicate matches it, and dropped if it fails all of them. A `StageRule` with no predicates configured at all has no filtering effect (every trace at that stage is retained); to actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, or `keep_tag_rules` must be set. + +A typical profile rises Hot → Warm → Cold by tightening predicates at each stage's `MIGRATION_OUT` event. For example: + +- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag (PostgreSQL, ActiveMQ, etc.). +- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the most-important tag. +- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces are dropped, only errors persist for the full 30-day retention. + +Predicates are direct boolean checks, so retention is deterministic and self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` matched, even though it was only 3 ms." Time enters only through *when* a partition reaches each stage, which is governed by `LifecycleStage.ttl` / `segment_interval`. + +## Downstream Lifecycle Actions Governed by Per-Stage Predicates + +The time-aging engine evaluates each trace against its current stage's `StageRule` predicates at the stage's configured lifecycle events, filtering data blocks as partitions are rewritten or migrated between lifecycle stages. + +### 6.1 Compaction Rewrites Review Comment: Section numbering jumps from §4.2 to §6.1, which makes cross-references harder to follow and looks unintentional. ########## docs/design/post-trace-pipeline.md: ########## @@ -0,0 +1,493 @@ +# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping Specification + +This document presents the complete technical design for the **BanyanDB Storage-Node Post-Trace Pipeline** and its integration with physical hardware storage tiers and time-aging schedulers. + +Unlike traditional streaming-based telemetry engines that perform span assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a **native trace model**. Spans are stored, sorted, and indexed by `trace_id` directly within the storage engine layout on individual data nodes. + +By decoupling trace assembly from the active ingestion path, the post-trace pipeline executes asynchronous tail-sampling, per-stage retention filtering, and data reduction natively during storage lifecycle events on the data nodes. This design is grounded in recent academic advancements in post-hoc retroactive tracing and storage-level file merge analysis. + +## Architectural Concept: Decoupled Gating and Per-Stage Retention + +To maximize storage efficiency and compute performance on the BanyanDB data nodes, trace evaluation is separated into two logical phases, which subsequently feed into an automated **time-aging system** for partition-level migration: + +```Plain Text ++-----------------------------------+ + | Grouped Trace Assembly | + +-----------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 1. Tail-Sampling Gating (tail_sampling, at finalization) | +| - Goal: Decide whether a trace is retained at all | +| - Criteria: sure-keep rules (duration / errors / tags) + sampling | ++-----------------------------------------------------------------------+ + | + +-----------------+-----------------+ + | (Retain) | (Drop / Purge) + v v ++---------------------------------------------------+ +-----------------+ +| 2. Per-Stage Retention (StageRule predicates) | | Discard Block | +| - Goal: At each stage's lifecycle event, | | Reclaim Space | +| keep / drop per the stage's predicates | +-----------------+ +| - Criteria: min_duration / keep_errors / | +| keep_tag_rules, per stage | ++---------------------------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 3. Time-Aging System (Stage-Stepped Migration Engine) | +| - Goal: Each stage's StageRule governs survival at its lifecycle | +| events (compaction, finalization, migration_out) | +| - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order) | +| - RULE: medium is a node-group concern (may be heterogeneous); | +| predicates govern per-stage retention, not medium selection. | ++-----------------------------------------------------------------------+ +``` + +### 1.1 Gating (`tail_sampling`) + +- **Operation Type**: Tail-sampling filter (keep/drop decision at finalization). It combines absolute *sure-keep* rules with a hashed probabilistic sample of everything else; it is not a pure boolean classifier. + +- **Responsibility**: Initial filtering at the finalization pass (§8.3). It determines whether an assembled trace block **survives** at all — passing into the stage lifecycle — or is **purged** immediately to reclaim storage space on disk. + +- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules (duration boundaries, error presence, tag match) plus a single `hash(trace_id) < healthy_sample_rate` comparison. The probabilistic keep is a deterministic `hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / drop decision is stable only once the trace has stopped growing: a partial trace's duration is a lower bound, so any drop is deferred until the trace's last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) then yields the same keep/drop result. + +### 1.2 Per-Stage Retention (StageRule predicates) + +- **Operation Type**: Per-stage keep/drop predicate evaluation. Each `StageRule` declares which trace properties (`min_duration`, `keep_errors`, `keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails every predicate is dropped at the stage's next configured lifecycle event. + +- **Responsibility**: Decides **which traces survive each stage** as data ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed by increasingly strict predicates at each stage's `MIGRATION_OUT` (or `COMPACTION`) event. The pipeline does **not** choose the physical storage medium; that is a node-group / `LifecycleStage` placement concern (§4.1). + +- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata (free, no decode); `has_error` requires a decoded read of span bodies only when `keep_errors` is set; tag matchers run against indexed tags. Predicates are direct boolean checks. + +## Protobuf Message Design + +The pipeline configuration is a single trace-typed message, `TracePipelineConfig`. Rather than a parallel abstract metadata layer, it reuses the existing catalog identifiers — Group (via `metadata`), lifecycle stage names, and schema names — and adds only the trace-specific tail-sampling and per-stage retention rules. General stream parsing and metrics aggregation configurations are excluded to maintain a strict focus on trace-centric analytical operations. + +### 2.1 Targeting Model: Reuse Existing Catalog Identifiers + +This design deliberately does **not** introduce an abstract `Pipeline` resource or an `ExecutionTrigger` enum. Both would re-declare targeting that the storage model already owns and let the two drift. A trace pipeline is instead expressed entirely with identifiers that already exist: + +- **Group** — named by the config's own `metadata.group` (`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that Group, exactly as every other schema resource does. The Group already fixes the `catalog`, so catalog is never repeated. +- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, `keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules are configured here, per stage and per pipeline — there are no hardcoded predicates in the engine. +- **Schema selector** — an explicit `schema_names` list (exact match on `common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches if it is listed OR matches the regex; both empty targets every schema in the Group. + +The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced by **stage lifecycle events**, because each trigger was already anchored to a stage: compaction is an LSM merge within a stage (§8.1), migration is a segment leaving one stage for the next (§8.2), and the scheduled pass finalizes a settled time segment within a stage (§8.3). Each `StageRule` carries its own `apply_at` selecting which of those events run the filter; empty means all of them. This keeps the trace-specific rules out of the generic, catalog-agnostic `common.v1.LifecycleStage` (which stream and measure groups share) while still binding them to stages by name. + +Multiple `TracePipelineConfig`s may coexist in a Group only when their effective coverage is disjoint; the schema registry enforces this with a per-tuple uniqueness rule (§2.3). + +### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`) + +The full message definitions live in the proto source at `api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package `banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is the single root resource: it carries its own identity (`metadata`), the targeting fields from §2.1, and the trace-specific tail-sampling and per-stage retention rules. There is no embedded abstract `Pipeline` and no `ExecutionTrigger`; the lifecycle events that run the filter are named by `apply_at`. + +The message set is: + +- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, the `tail_sampling` (gating) block, and the completeness windows `merge_grace` (§8.1) and `finalize_grace` (§8.3). +- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, `min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are OR-combined: a trace survives if any one matches; it is dropped if it fails all set predicates. A rule with no predicates set at all has no filtering effect (see §4.2). +- **`StageEvent`** — enum of the lifecycle events a rule may run at: `STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and `STAGE_EVENT_MIGRATION_OUT` (§8.2). +- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: `duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` (each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`). +- **`TagMatcher`** — sure-keep tag predicate used by `StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries no `sample_rate` — a match is an absolute keep. + +### 2.3 Uniqueness and Conflict Policy + +Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` tuple**. The schema registry enforces this on write so the filter behavior at any tuple is deterministic — there is no implicit ordering, priority, or composition of overlapping pipelines. + +**Effective coverage of a pipeline.** A `TracePipelineConfig` P with `enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when: + +- `schema` is a schema in `P.metadata.group` whose name is listed in `P.schema_names` or matches `P.schema_name_regex` (both empty selectors target every schema in the Group); +- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every stage in the Group's `ResourceOpts`); +- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers every `StageEvent`). + +**Conflict detection on write.** When a `TracePipelineConfig` is created or updated, the registry expands its effective coverage against the current schemas and stages of its Group and rejects the write if **any** covered tuple is already covered by another `enabled` pipeline in the same Group. The error names the conflicting pipeline and the overlapping tuple(s). + +**Schema additions.** When a new `Trace` schema is created in a Group, the registry re-validates every `enabled` `TracePipelineConfig` in that Group against the new schema. If the new schema would cause two pipelines to overlap on any `(stage, event)`, the schema creation is rejected; the operator must narrow the pipelines (e.g. add explicit `schema_names`) before adding the schema. + +**Disabling and replacement.** Setting `enabled=false` removes a pipeline from active coverage immediately, freeing its tuples for another pipeline to claim. This is the safe path for atomically swapping in a new pipeline: disable the old, enable or create the new, then delete the old. There is no implicit precedence between two `enabled` pipelines — the registry will not pick a winner. + +**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and migration (§8.2) the engine selects the active pipeline by `(group, schema, stage, event)`. If a registry inconsistency ever exposes more than one active match for a tuple, the engine logs an error and applies **none** of them — failing safe to retain rather than risk a nondeterministic destructive drop. + +**Recommended pattern.** For most workloads, a single pipeline per Group with a broad `schema_name_regex` is the simplest configuration. Multiple pipelines in a Group are useful only when their schema selectors are mutually exclusive (for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in a hypothetical mixed group) so no tuple is doubly covered. + +### 2.4 Admission Validation + +The proto pins bounds with PGV so a malformed config is rejected at parse time rather than producing undefined behavior. The schema registry's admission control adds one cross-field rule that PGV cannot express. + +**PGV-enforced bounds** (rejected at proto parse / `Write` time): + +- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 0s`). +- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in `[0.0, 1.0]`. +- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset means duration is not a retention factor for that stage. +- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — strictly positive when set; unset means engine default (§8.1 / §8.3). + +**Admission rule (server-side; not expressible in PGV):** + +- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = true`, the config must declare at least one of: a non-empty `tail_sampling`, or at least one `StageRule` with at least one retention predicate (`min_duration`, `keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with no gating and no per-stage predicates has no effect on traces and is rejected as a configuration error. To temporarily disable a pipeline, set `enabled = false` — do not strip its rule blocks. + +Both the registry write path and the runtime apply these checks. If a registry inconsistency ever lets a malformed config reach the engine, the engine logs an error and falls back to **retain** for every tuple the malformed config would have governed — never a destructive drop (consistent with the §2.3 runtime fail-safe). + +## Retention-Decision Timing & Trace Completeness + +Both the tail-sampling gate (§1.1) and the per-stage retention predicates (§1.2) operate on the **whole** trace, not on individual spans as they arrive: `D_total` needs the earliest start and latest end, `has_error` scans all spans, and tag matchers scan all spans. Spans of one trace, however, arrive at the storage node anywhere from milliseconds to hours apart, and BanyanDB writes each span into the segment selected by its **event time** (the `start_time` tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no live partial-trace buffer; a trace exists only as spans co-located by `trace_id` inside a segment's parts. + +### 3.1 When Retention Decisions Are Safe + +Predicates are not evaluated for destructive drops at write time. They run lazily, by the post-trace passes that re-assemble a trace by `trace_id`: + +- **Provisional** — during an LSM compaction merge (§8.1). This sees only the spans that have landed so far, so a drop verdict could be premature; the per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace has stopped growing. +- **Authoritative** — at the post-trace scheduler's finalization pass once the trace's segment has **settled**: its event-time window has closed *and* the event-time watermark has advanced past it by the `finalize_grace` period (§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates the next segment up to an hour *before* the current window ends, and late data keeps routing to an old segment by event time — so "settled" is a watermark heuristic, not a guarantee. +- **At migration_out** — by the time a segment is migrated to the next stage (≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every trace it holds has been settled, so predicate evaluation is final. + +Two completeness caveats follow from event-time segmentation: + +1. **Spans arriving after finalization.** A span whose `start_time` falls in an already-finalized segment is still accepted (while the segment is within retention) but is missed by the finalization-time evaluation. The `finalize_grace` period bounds how often this happens; a late span that lands after it is only re-evaluated if a subsequent compaction rewrites that part. + +2. **Traces spanning multiple segments.** A trace longer than `segment_interval`, or one that straddles a boundary, is physically split across two segments, and each segment's pass evaluates only its own fragment (`D_total` and the indicators are fragment-local). BanyanDB performs no cross-segment trace assembly at the storage layer. Where trace duration is far below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day segments) the fragment equals the whole trace and this is a non-issue; for long-running traces it is a known limitation. + +## Integration of the Retention System & Time-Aging System + +Distributed telemetry's diagnostic utility naturally decays as time passes. The pipeline must therefore decide, per stage, what to keep — but it must **not** try to pick the physical storage medium, because in BanyanDB the medium is already determined by where a `LifecycleStage` is placed. + +### 4.1 Stage Placement vs. Rule-Governed Retention + +The physical medium is **not** fixed per tier by this design. Each `LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through its `node_selector`; the medium is whatever hardware that node group runs, and operators may deploy heterogeneous backings (for example two warm node groups on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA SSD/HDD, Cold → object storage, but that mapping is an operator deployment choice, not a rule this pipeline enforces. + +Because medium selection already belongs to stage placement, per-stage `StageRule` predicates **do not route traces to a medium**. Within whatever stage currently holds a trace, the predicates only act as a **retention gate** — dictating whether a trace is retained when a partition is rewritten or migrated, or discarded/pruned to reduce storage footprint. + +### 4.2 Stage-Stepped Retention via Rising Predicates + +"Aging" is expressed structurally: each `StageRule` declares its own predicates, and the bar rises as data moves to colder stages. **Semantics of a `StageRule` at one of its `apply_at` events:** the set predicates are OR-combined — a trace is retained if **any** set predicate matches it, and dropped if it fails all of them. A `StageRule` with no predicates configured at all has no filtering effect (every trace at that stage is retained); to actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, or `keep_tag_rules` must be set. + +A typical profile rises Hot → Warm → Cold by tightening predicates at each stage's `MIGRATION_OUT` event. For example: + +- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag (PostgreSQL, ActiveMQ, etc.). +- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the most-important tag. +- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces are dropped, only errors persist for the full 30-day retention. + +Predicates are direct boolean checks, so retention is deterministic and self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` matched, even though it was only 3 ms." Time enters only through *when* a partition reaches each stage, which is governed by `LifecycleStage.ttl` / `segment_interval`. + +## Downstream Lifecycle Actions Governed by Per-Stage Predicates + +The time-aging engine evaluates each trace against its current stage's `StageRule` predicates at the stage's configured lifecycle events, filtering data blocks as partitions are rewritten or migrated between lifecycle stages. + +### 6.1 Compaction Rewrites + +During normal LSM merge operations inside a stage, the data node compacts older parts. When a `StageRule` includes `STAGE_EVENT_COMPACTION` in its `apply_at`, the merge filter (§8.1) evaluates each mature trace against the stage's predicates and omits non-matching traces from the consolidated output. Compaction is then both a space-reclamation pass and a per-stage retention pass; pipelines that target only migration boundaries leave the routine LSM compaction byte-for-byte lossless. + +### 6.2 Partition-Level Tier Migration + +Both showcase trace groups (`sw_trace`, `sw_zipkinTrace`) declare a Hot → Warm → Cold lifecycle in `ResourceOpts.stages`: Hot holds the active window (1-day ttl), Warm runs on `node_selector type=warm` (7-day ttl), and Cold runs on `node_selector type=cold` with `close=true` (30-day ttl). When a Hot segment matures past its 1-day ttl, the migration worker copies it to the Warm node group; the medium behind each node group (e.g. NVMe → SATA SSD → object storage) is the operator's deployment choice (§4.1), not something this pipeline picks. + +- The migration engine evaluates the source stage's `MIGRATION_OUT` predicates against every trace in the partition. + +- **No dynamic splitting is performed:** all retained data is written to the next stage's node group. Traces that fail the source stage's `MIGRATION_OUT` predicates are omitted from the target write stream, reducing the physical size of the migrated partition. With Scenario 7.1's Hot predicates a healthy `/homepage` trace (2802 ms) matches `min_duration: 100ms` and migrates to Warm; a PostgreSQL-touching trace matches `keep_tag_rules` and migrates too; a healthy fast trace (6 ms) matches none and is dropped. + +- When the partition matures past the Warm 7-day ttl, the Warm stage's `MIGRATION_OUT` predicates apply (typically stricter — e.g. `min_duration: 500ms`); only matching traces are written into the Cold parts. Healthy baseline traces, even ones that survived Warm, are dropped here; error traces (with `keep_errors: true`) persist all the way to Cold. + +### 6.3 Eviction: Per-Trace Drop During Rewrites, Segment-Granularity GC Review Comment: Section numbering jumps from §4.2 to §6.3; renumbering to §5.3 would keep the document's subsection numbering contiguous. ########## docs/design/post-trace-pipeline.md: ########## @@ -0,0 +1,493 @@ +# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping Specification + +This document presents the complete technical design for the **BanyanDB Storage-Node Post-Trace Pipeline** and its integration with physical hardware storage tiers and time-aging schedulers. + +Unlike traditional streaming-based telemetry engines that perform span assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a **native trace model**. Spans are stored, sorted, and indexed by `trace_id` directly within the storage engine layout on individual data nodes. + +By decoupling trace assembly from the active ingestion path, the post-trace pipeline executes asynchronous tail-sampling, per-stage retention filtering, and data reduction natively during storage lifecycle events on the data nodes. This design is grounded in recent academic advancements in post-hoc retroactive tracing and storage-level file merge analysis. + +## Architectural Concept: Decoupled Gating and Per-Stage Retention + +To maximize storage efficiency and compute performance on the BanyanDB data nodes, trace evaluation is separated into two logical phases, which subsequently feed into an automated **time-aging system** for partition-level migration: + +```Plain Text ++-----------------------------------+ + | Grouped Trace Assembly | + +-----------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 1. Tail-Sampling Gating (tail_sampling, at finalization) | +| - Goal: Decide whether a trace is retained at all | +| - Criteria: sure-keep rules (duration / errors / tags) + sampling | ++-----------------------------------------------------------------------+ + | + +-----------------+-----------------+ + | (Retain) | (Drop / Purge) + v v ++---------------------------------------------------+ +-----------------+ +| 2. Per-Stage Retention (StageRule predicates) | | Discard Block | +| - Goal: At each stage's lifecycle event, | | Reclaim Space | +| keep / drop per the stage's predicates | +-----------------+ +| - Criteria: min_duration / keep_errors / | +| keep_tag_rules, per stage | ++---------------------------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 3. Time-Aging System (Stage-Stepped Migration Engine) | +| - Goal: Each stage's StageRule governs survival at its lifecycle | +| events (compaction, finalization, migration_out) | +| - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order) | +| - RULE: medium is a node-group concern (may be heterogeneous); | +| predicates govern per-stage retention, not medium selection. | ++-----------------------------------------------------------------------+ +``` + +### 1.1 Gating (`tail_sampling`) + +- **Operation Type**: Tail-sampling filter (keep/drop decision at finalization). It combines absolute *sure-keep* rules with a hashed probabilistic sample of everything else; it is not a pure boolean classifier. + +- **Responsibility**: Initial filtering at the finalization pass (§8.3). It determines whether an assembled trace block **survives** at all — passing into the stage lifecycle — or is **purged** immediately to reclaim storage space on disk. + +- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules (duration boundaries, error presence, tag match) plus a single `hash(trace_id) < healthy_sample_rate` comparison. The probabilistic keep is a deterministic `hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / drop decision is stable only once the trace has stopped growing: a partial trace's duration is a lower bound, so any drop is deferred until the trace's last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) then yields the same keep/drop result. + +### 1.2 Per-Stage Retention (StageRule predicates) + +- **Operation Type**: Per-stage keep/drop predicate evaluation. Each `StageRule` declares which trace properties (`min_duration`, `keep_errors`, `keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails every predicate is dropped at the stage's next configured lifecycle event. + +- **Responsibility**: Decides **which traces survive each stage** as data ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed by increasingly strict predicates at each stage's `MIGRATION_OUT` (or `COMPACTION`) event. The pipeline does **not** choose the physical storage medium; that is a node-group / `LifecycleStage` placement concern (§4.1). + +- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata (free, no decode); `has_error` requires a decoded read of span bodies only when `keep_errors` is set; tag matchers run against indexed tags. Predicates are direct boolean checks. + +## Protobuf Message Design + +The pipeline configuration is a single trace-typed message, `TracePipelineConfig`. Rather than a parallel abstract metadata layer, it reuses the existing catalog identifiers — Group (via `metadata`), lifecycle stage names, and schema names — and adds only the trace-specific tail-sampling and per-stage retention rules. General stream parsing and metrics aggregation configurations are excluded to maintain a strict focus on trace-centric analytical operations. + +### 2.1 Targeting Model: Reuse Existing Catalog Identifiers + +This design deliberately does **not** introduce an abstract `Pipeline` resource or an `ExecutionTrigger` enum. Both would re-declare targeting that the storage model already owns and let the two drift. A trace pipeline is instead expressed entirely with identifiers that already exist: + +- **Group** — named by the config's own `metadata.group` (`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that Group, exactly as every other schema resource does. The Group already fixes the `catalog`, so catalog is never repeated. +- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, `keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules are configured here, per stage and per pipeline — there are no hardcoded predicates in the engine. +- **Schema selector** — an explicit `schema_names` list (exact match on `common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches if it is listed OR matches the regex; both empty targets every schema in the Group. + +The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced by **stage lifecycle events**, because each trigger was already anchored to a stage: compaction is an LSM merge within a stage (§8.1), migration is a segment leaving one stage for the next (§8.2), and the scheduled pass finalizes a settled time segment within a stage (§8.3). Each `StageRule` carries its own `apply_at` selecting which of those events run the filter; empty means all of them. This keeps the trace-specific rules out of the generic, catalog-agnostic `common.v1.LifecycleStage` (which stream and measure groups share) while still binding them to stages by name. + +Multiple `TracePipelineConfig`s may coexist in a Group only when their effective coverage is disjoint; the schema registry enforces this with a per-tuple uniqueness rule (§2.3). + +### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`) + +The full message definitions live in the proto source at `api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package `banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is the single root resource: it carries its own identity (`metadata`), the targeting fields from §2.1, and the trace-specific tail-sampling and per-stage retention rules. There is no embedded abstract `Pipeline` and no `ExecutionTrigger`; the lifecycle events that run the filter are named by `apply_at`. + +The message set is: + +- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, the `tail_sampling` (gating) block, and the completeness windows `merge_grace` (§8.1) and `finalize_grace` (§8.3). +- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, `min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are OR-combined: a trace survives if any one matches; it is dropped if it fails all set predicates. A rule with no predicates set at all has no filtering effect (see §4.2). +- **`StageEvent`** — enum of the lifecycle events a rule may run at: `STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and `STAGE_EVENT_MIGRATION_OUT` (§8.2). +- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: `duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` (each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`). +- **`TagMatcher`** — sure-keep tag predicate used by `StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries no `sample_rate` — a match is an absolute keep. + +### 2.3 Uniqueness and Conflict Policy + +Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` tuple**. The schema registry enforces this on write so the filter behavior at any tuple is deterministic — there is no implicit ordering, priority, or composition of overlapping pipelines. + +**Effective coverage of a pipeline.** A `TracePipelineConfig` P with `enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when: + +- `schema` is a schema in `P.metadata.group` whose name is listed in `P.schema_names` or matches `P.schema_name_regex` (both empty selectors target every schema in the Group); +- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every stage in the Group's `ResourceOpts`); +- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers every `StageEvent`). + +**Conflict detection on write.** When a `TracePipelineConfig` is created or updated, the registry expands its effective coverage against the current schemas and stages of its Group and rejects the write if **any** covered tuple is already covered by another `enabled` pipeline in the same Group. The error names the conflicting pipeline and the overlapping tuple(s). + +**Schema additions.** When a new `Trace` schema is created in a Group, the registry re-validates every `enabled` `TracePipelineConfig` in that Group against the new schema. If the new schema would cause two pipelines to overlap on any `(stage, event)`, the schema creation is rejected; the operator must narrow the pipelines (e.g. add explicit `schema_names`) before adding the schema. + +**Disabling and replacement.** Setting `enabled=false` removes a pipeline from active coverage immediately, freeing its tuples for another pipeline to claim. This is the safe path for atomically swapping in a new pipeline: disable the old, enable or create the new, then delete the old. There is no implicit precedence between two `enabled` pipelines — the registry will not pick a winner. + +**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and migration (§8.2) the engine selects the active pipeline by `(group, schema, stage, event)`. If a registry inconsistency ever exposes more than one active match for a tuple, the engine logs an error and applies **none** of them — failing safe to retain rather than risk a nondeterministic destructive drop. + +**Recommended pattern.** For most workloads, a single pipeline per Group with a broad `schema_name_regex` is the simplest configuration. Multiple pipelines in a Group are useful only when their schema selectors are mutually exclusive (for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in a hypothetical mixed group) so no tuple is doubly covered. + +### 2.4 Admission Validation + +The proto pins bounds with PGV so a malformed config is rejected at parse time rather than producing undefined behavior. The schema registry's admission control adds one cross-field rule that PGV cannot express. + +**PGV-enforced bounds** (rejected at proto parse / `Write` time): + +- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 0s`). +- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in `[0.0, 1.0]`. +- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset means duration is not a retention factor for that stage. +- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — strictly positive when set; unset means engine default (§8.1 / §8.3). + +**Admission rule (server-side; not expressible in PGV):** + +- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = true`, the config must declare at least one of: a non-empty `tail_sampling`, or at least one `StageRule` with at least one retention predicate (`min_duration`, `keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with no gating and no per-stage predicates has no effect on traces and is rejected as a configuration error. To temporarily disable a pipeline, set `enabled = false` — do not strip its rule blocks. + +Both the registry write path and the runtime apply these checks. If a registry inconsistency ever lets a malformed config reach the engine, the engine logs an error and falls back to **retain** for every tuple the malformed config would have governed — never a destructive drop (consistent with the §2.3 runtime fail-safe). + +## Retention-Decision Timing & Trace Completeness + +Both the tail-sampling gate (§1.1) and the per-stage retention predicates (§1.2) operate on the **whole** trace, not on individual spans as they arrive: `D_total` needs the earliest start and latest end, `has_error` scans all spans, and tag matchers scan all spans. Spans of one trace, however, arrive at the storage node anywhere from milliseconds to hours apart, and BanyanDB writes each span into the segment selected by its **event time** (the `start_time` tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no live partial-trace buffer; a trace exists only as spans co-located by `trace_id` inside a segment's parts. + +### 3.1 When Retention Decisions Are Safe + +Predicates are not evaluated for destructive drops at write time. They run lazily, by the post-trace passes that re-assemble a trace by `trace_id`: + +- **Provisional** — during an LSM compaction merge (§8.1). This sees only the spans that have landed so far, so a drop verdict could be premature; the per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace has stopped growing. +- **Authoritative** — at the post-trace scheduler's finalization pass once the trace's segment has **settled**: its event-time window has closed *and* the event-time watermark has advanced past it by the `finalize_grace` period (§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates the next segment up to an hour *before* the current window ends, and late data keeps routing to an old segment by event time — so "settled" is a watermark heuristic, not a guarantee. +- **At migration_out** — by the time a segment is migrated to the next stage (≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every trace it holds has been settled, so predicate evaluation is final. + +Two completeness caveats follow from event-time segmentation: + +1. **Spans arriving after finalization.** A span whose `start_time` falls in an already-finalized segment is still accepted (while the segment is within retention) but is missed by the finalization-time evaluation. The `finalize_grace` period bounds how often this happens; a late span that lands after it is only re-evaluated if a subsequent compaction rewrites that part. + +2. **Traces spanning multiple segments.** A trace longer than `segment_interval`, or one that straddles a boundary, is physically split across two segments, and each segment's pass evaluates only its own fragment (`D_total` and the indicators are fragment-local). BanyanDB performs no cross-segment trace assembly at the storage layer. Where trace duration is far below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day segments) the fragment equals the whole trace and this is a non-issue; for long-running traces it is a known limitation. + +## Integration of the Retention System & Time-Aging System + +Distributed telemetry's diagnostic utility naturally decays as time passes. The pipeline must therefore decide, per stage, what to keep — but it must **not** try to pick the physical storage medium, because in BanyanDB the medium is already determined by where a `LifecycleStage` is placed. + +### 4.1 Stage Placement vs. Rule-Governed Retention + +The physical medium is **not** fixed per tier by this design. Each `LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through its `node_selector`; the medium is whatever hardware that node group runs, and operators may deploy heterogeneous backings (for example two warm node groups on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA SSD/HDD, Cold → object storage, but that mapping is an operator deployment choice, not a rule this pipeline enforces. + +Because medium selection already belongs to stage placement, per-stage `StageRule` predicates **do not route traces to a medium**. Within whatever stage currently holds a trace, the predicates only act as a **retention gate** — dictating whether a trace is retained when a partition is rewritten or migrated, or discarded/pruned to reduce storage footprint. + +### 4.2 Stage-Stepped Retention via Rising Predicates + +"Aging" is expressed structurally: each `StageRule` declares its own predicates, and the bar rises as data moves to colder stages. **Semantics of a `StageRule` at one of its `apply_at` events:** the set predicates are OR-combined — a trace is retained if **any** set predicate matches it, and dropped if it fails all of them. A `StageRule` with no predicates configured at all has no filtering effect (every trace at that stage is retained); to actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, or `keep_tag_rules` must be set. + +A typical profile rises Hot → Warm → Cold by tightening predicates at each stage's `MIGRATION_OUT` event. For example: + +- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag (PostgreSQL, ActiveMQ, etc.). +- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the most-important tag. +- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces are dropped, only errors persist for the full 30-day retention. + +Predicates are direct boolean checks, so retention is deterministic and self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` matched, even though it was only 3 ms." Time enters only through *when* a partition reaches each stage, which is governed by `LifecycleStage.ttl` / `segment_interval`. + +## Downstream Lifecycle Actions Governed by Per-Stage Predicates + +The time-aging engine evaluates each trace against its current stage's `StageRule` predicates at the stage's configured lifecycle events, filtering data blocks as partitions are rewritten or migrated between lifecycle stages. + +### 6.1 Compaction Rewrites + +During normal LSM merge operations inside a stage, the data node compacts older parts. When a `StageRule` includes `STAGE_EVENT_COMPACTION` in its `apply_at`, the merge filter (§8.1) evaluates each mature trace against the stage's predicates and omits non-matching traces from the consolidated output. Compaction is then both a space-reclamation pass and a per-stage retention pass; pipelines that target only migration boundaries leave the routine LSM compaction byte-for-byte lossless. + +### 6.2 Partition-Level Tier Migration Review Comment: Section numbering jumps from §4.2 to §6.2; if §6 is meant to be the next major section after §4, consider renumbering these subsections to keep the sequence consistent. ########## api/proto/banyandb/pipeline/v1/trace_pipeline.proto: ########## @@ -0,0 +1,187 @@ +// Licensed to Apache Software Foundation (ASF) under one or more contributor +// license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright +// ownership. Apache Software Foundation (ASF) licenses this file to you under +// the Apache License, Version 2.0 (the "License"); you may +// not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto3"; + +package banyandb.pipeline.v1; + +import "banyandb/common/v1/common.proto"; +import "google/protobuf/duration.proto"; +import "validate/validate.proto"; + +option go_package = "github.com/apache/skywalking-banyandb/api/proto/banyandb/pipeline/v1"; +option java_package = "org.apache.skywalking.banyandb.pipeline.v1"; + +// StageEvent enumerates the lifecycle events of a stage at which the filter runs. +enum StageEvent { + STAGE_EVENT_UNSPECIFIED = 0; + // In-line filter during an LSM merge within a stage. + STAGE_EVENT_COMPACTION = 1; + // Scheduled final filter once a segment has settled: its event-time window + // has closed and the watermark has advanced past it by a grace period. + // (BanyanDB has no hard "seal" event; settling is watermark-driven.) + STAGE_EVENT_FINALIZE = 2; + // Pre-migration rewrite when a segment leaves this stage for the next. + STAGE_EVENT_MIGRATION_OUT = 3; +} + +// TracePipelineConfig is the root configuration for a storage-node trace pipeline. +// It reuses existing catalog identifiers (group via metadata, stage names, schema +// names) for targeting instead of declaring a parallel metadata model. +message TracePipelineConfig { + // Identity and revision tracking; metadata.group is the Group this pipeline + // lives in and applies to, consistent with every other schema resource. + common.v1.Metadata metadata = 1; + // Active status of the pipeline. + bool enabled = 2; + // Per-stage retention rules: which lifecycle stages this pipeline acts on, + // with the keep-predicates and lifecycle events for each. Empty means the + // tail-sampling gate at finalization is the only filter; no per-stage drop + // is applied. + repeated StageRule stages = 3; + // Explicit schema names to target within the Group (exact match on Metadata.name). + repeated string schema_names = 4; + // RE2 regular expression matched against schema names. A schema is targeted if it + // is listed in schema_names OR matches this pattern. When both are empty, every + // schema in the Group is targeted. + string schema_name_regex = 5; + // Tail-sampling (gating) rules applied at finalization (§8.3) to decide + // whether a trace is retained at all once its segment has settled. + TailSampling tail_sampling = 6; + // Per-trace maturity window for the in-line merge filter (§8.1). A trace is + // eligible for dropping during an LSM compaction merge only once its latest + // span timestamp is older than `now - merge_grace`; younger traces pass + // through the merge unchanged. Bounds the expected intra-trace span arrival + // spread (typically seconds). Must be strictly positive if set; if unset, the + // engine uses a documented default (currently 30s). + google.protobuf.Duration merge_grace = 7 [(validate.rules).duration = { + gt: {seconds: 0} + }]; + // Per-segment settling window for the scheduled finalization pass (§8.3). A + // segment is treated as settled, and the authoritative final filter runs, + // once the event-time watermark exceeds `segment.End + finalize_grace`. + // Bounds segment-wide late arrival (typically minutes). Must be strictly + // positive if set; if unset, the engine uses a documented default + // (currently 5m). Distinct from `merge_grace`: `merge_grace` is per-trace, + // `finalize_grace` is per-segment. + google.protobuf.Duration finalize_grace = 8 [(validate.rules).duration = { + gt: {seconds: 0} + }]; +} + +// StageRule binds the pipeline to one lifecycle stage of the targeted Group +// and declares per-stage retention predicates. +// +// Semantics (when the rule fires at one of its `apply_at` events): +// - If every predicate field is unset (no `min_duration`, no `keep_errors`, +// and an empty `keep_tag_rules`) the rule has no filtering effect — every +// trace at this stage is retained, and the tail-sampling gate at +// finalization remains the only filter affecting traces here. +// - Otherwise the *set* predicates are combined with OR: a trace is retained +// if any single set predicate matches it; a trace that fails all set +// predicates is dropped from this stage at the firing event. +// +// Within `keep_tag_rules`, the individual `TagMatcher` entries are likewise +// OR-combined (any matcher match satisfies the keep_tag_rules predicate). +message StageRule { + // Stage name from the Group's ResourceOpts.stages (e.g. "hot", "warm", "cold"). + string stage = 1; + // Which of this stage's lifecycle events run the filter. Empty means all of + // them (compaction merges, segment finalization, and outbound migration). + repeated StageEvent apply_at = 2; + // Minimum total trace duration to retain at this stage. A trace whose + // duration is at least min_duration is retained. Strictly positive if set; + // if unset, duration is not a retention factor for this stage. + google.protobuf.Duration min_duration = 3 [(validate.rules).duration = { + gt: {seconds: 0} + }]; + // If true, traces with any error span are retained at this stage regardless + // of duration or tag predicates. + bool keep_errors = 4; + // Sure-keep tag matchers: a trace satisfies this predicate when ANY of the + // matchers matches any of its spans (matchers are OR-combined). Distinct + // from TailSampling.tag_rules (which is for ingest-time gating with a + // sample_rate); TagMatcher has no sample_rate — a match is an absolute keep. + repeated TagMatcher keep_tag_rules = 5; +} + +// TailSampling manages tail-sampling (ingest-gating) decisions at finalization. +message TailSampling { + // Absolute trace duration threshold. Traces whose total latency meets or + // exceeds this value are preserved (a sure-keep at gating). Required and + // strictly positive. + google.protobuf.Duration duration_threshold = 1 [(validate.rules).duration = { + required: true + gt: {seconds: 0} + }]; + // Immediate retention rule when any span contains an error state. + bool keep_all_errors = 2; + // Probabilistic keep rate for healthy, low-latency traces that match no + // sure-keep rule. The decision is a deterministic hash(trace_id) < rate, so + // it is stable across re-evaluation. Must be in [0.0, 1.0]. + double healthy_sample_rate = 3 [(validate.rules).double = { + gte: 0.0 + lte: 1.0 + }]; + // Conditions based on tags for immediate sampling decisions. + repeated TagSamplingRule tag_rules = 4; +} + +// TagSamplingRule keeps traces whose tag value satisfies a matcher, with an +// optional probabilistic sample rate (used by the tail-sampling gate). +message TagSamplingRule { + string tag_key = 1; + // Matching condition against the tag's value. Review Comment: `TagSamplingRule.tag_key` should be non-empty; otherwise a rule can be accepted but never match any real tag. The codebase typically validates identifier strings with `string.min_len = 1`. ########## docs/design/post-trace-pipeline.md: ########## @@ -0,0 +1,493 @@ +# BanyanDB Storage-Node Post-Trace Pipeline & Storage Tier Mapping Specification + +This document presents the complete technical design for the **BanyanDB Storage-Node Post-Trace Pipeline** and its integration with physical hardware storage tiers and time-aging schedulers. + +Unlike traditional streaming-based telemetry engines that perform span assembly inside active, memory-heavy windowing pipelines, BanyanDB relies on a **native trace model**. Spans are stored, sorted, and indexed by `trace_id` directly within the storage engine layout on individual data nodes. + +By decoupling trace assembly from the active ingestion path, the post-trace pipeline executes asynchronous tail-sampling, per-stage retention filtering, and data reduction natively during storage lifecycle events on the data nodes. This design is grounded in recent academic advancements in post-hoc retroactive tracing and storage-level file merge analysis. + +## Architectural Concept: Decoupled Gating and Per-Stage Retention + +To maximize storage efficiency and compute performance on the BanyanDB data nodes, trace evaluation is separated into two logical phases, which subsequently feed into an automated **time-aging system** for partition-level migration: + +```Plain Text ++-----------------------------------+ + | Grouped Trace Assembly | + +-----------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 1. Tail-Sampling Gating (tail_sampling, at finalization) | +| - Goal: Decide whether a trace is retained at all | +| - Criteria: sure-keep rules (duration / errors / tags) + sampling | ++-----------------------------------------------------------------------+ + | + +-----------------+-----------------+ + | (Retain) | (Drop / Purge) + v v ++---------------------------------------------------+ +-----------------+ +| 2. Per-Stage Retention (StageRule predicates) | | Discard Block | +| - Goal: At each stage's lifecycle event, | | Reclaim Space | +| keep / drop per the stage's predicates | +-----------------+ +| - Criteria: min_duration / keep_errors / | +| keep_tag_rules, per stage | ++---------------------------------------------------+ + | + v ++-----------------------------------------------------------------------+ +| 3. Time-Aging System (Stage-Stepped Migration Engine) | +| - Goal: Each stage's StageRule governs survival at its lifecycle | +| events (compaction, finalization, migration_out) | +| - Stage Migration: Hot -> Warm -> Cold (LifecycleStage order) | +| - RULE: medium is a node-group concern (may be heterogeneous); | +| predicates govern per-stage retention, not medium selection. | ++-----------------------------------------------------------------------+ +``` + +### 1.1 Gating (`tail_sampling`) + +- **Operation Type**: Tail-sampling filter (keep/drop decision at finalization). It combines absolute *sure-keep* rules with a hashed probabilistic sample of everything else; it is not a pure boolean classifier. + +- **Responsibility**: Initial filtering at the finalization pass (§8.3). It determines whether an assembled trace block **survives** at all — passing into the stage lifecycle — or is **purged** immediately to reclaim storage space on disk. + +- **Compute Profile**: Low CPU overhead. It evaluates absolute keep-rules (duration boundaries, error presence, tag match) plus a single `hash(trace_id) < healthy_sample_rate` comparison. The probabilistic keep is a deterministic `hash(trace_id)`, so it is idempotent across re-evaluation. The sure-keep / drop decision is stable only once the trace has stopped growing: a partial trace's duration is a lower bound, so any drop is deferred until the trace's last span is older than `now − merge_grace` (§8.1). Re-evaluating a matured trace at compaction (§8.1), finalization (§8.3), or outbound migration (§8.2) then yields the same keep/drop result. + +### 1.2 Per-Stage Retention (StageRule predicates) + +- **Operation Type**: Per-stage keep/drop predicate evaluation. Each `StageRule` declares which trace properties (`min_duration`, `keep_errors`, `keep_tag_rules`) constitute a sure-keep at that stage; a trace that fails every predicate is dropped at the stage's next configured lifecycle event. + +- **Responsibility**: Decides **which traces survive each stage** as data ages. The "rising bar" effect — Hot keeps more, Cold keeps less — is expressed by increasingly strict predicates at each stage's `MIGRATION_OUT` (or `COMPACTION`) event. The pipeline does **not** choose the physical storage medium; that is a node-group / `LifecycleStage` placement concern (§4.1). + +- **Compute Profile**: Low CPU overhead. `D_total` comes from block metadata (free, no decode); `has_error` requires a decoded read of span bodies only when `keep_errors` is set; tag matchers run against indexed tags. Predicates are direct boolean checks. + +## Protobuf Message Design + +The pipeline configuration is a single trace-typed message, `TracePipelineConfig`. Rather than a parallel abstract metadata layer, it reuses the existing catalog identifiers — Group (via `metadata`), lifecycle stage names, and schema names — and adds only the trace-specific tail-sampling and per-stage retention rules. General stream parsing and metrics aggregation configurations are excluded to maintain a strict focus on trace-centric analytical operations. + +### 2.1 Targeting Model: Reuse Existing Catalog Identifiers + +This design deliberately does **not** introduce an abstract `Pipeline` resource or an `ExecutionTrigger` enum. Both would re-declare targeting that the storage model already owns and let the two drift. A trace pipeline is instead expressed entirely with identifiers that already exist: + +- **Group** — named by the config's own `metadata.group` (`common.v1.Metadata`). A `TracePipelineConfig` lives in, and applies to, that Group, exactly as every other schema resource does. The Group already fixes the `catalog`, so catalog is never repeated. +- **Lifecycle stages** — a `StageRule` per targeted stage, each naming a stage from the Group's `ResourceOpts.stages` (e.g. `"hot"`, `"warm"`, `"cold"`) and carrying that stage's **retention predicates** (`min_duration`, `keep_errors`, `keep_tag_rules`) and `apply_at` events. Stage names are the same vocabulary queries already accept (`trace/v1/query.proto`'s `stages`). All retention rules are configured here, per stage and per pipeline — there are no hardcoded predicates in the engine. +- **Schema selector** — an explicit `schema_names` list (exact match on `common.v1.Metadata.name`) plus a `schema_name_regex` (RE2). A schema matches if it is listed OR matches the regex; both empty targets every schema in the Group. + +The former `ExecutionTrigger` (COMPACTION / MIGRATION / SCHEDULED) is replaced by **stage lifecycle events**, because each trigger was already anchored to a stage: compaction is an LSM merge within a stage (§8.1), migration is a segment leaving one stage for the next (§8.2), and the scheduled pass finalizes a settled time segment within a stage (§8.3). Each `StageRule` carries its own `apply_at` selecting which of those events run the filter; empty means all of them. This keeps the trace-specific rules out of the generic, catalog-agnostic `common.v1.LifecycleStage` (which stream and measure groups share) while still binding them to stages by name. + +Multiple `TracePipelineConfig`s may coexist in a Group only when their effective coverage is disjoint; the schema registry enforces this with a per-tuple uniqueness rule (§2.3). + +### 2.2 Trace Pipeline Specification (`trace_pipeline.proto`) + +The full message definitions live in the proto source at `api/proto/banyandb/pipeline/v1/trace_pipeline.proto` (package `banyandb.pipeline.v1`); they are not duplicated here. `TracePipelineConfig` is the single root resource: it carries its own identity (`metadata`), the targeting fields from §2.1, and the trace-specific tail-sampling and per-stage retention rules. There is no embedded abstract `Pipeline` and no `ExecutionTrigger`; the lifecycle events that run the filter are named by `apply_at`. + +The message set is: + +- **`TracePipelineConfig`** — root resource: `metadata`, `enabled`, the per-stage `stages` rules, the `schema_names` / `schema_name_regex` selector, the `tail_sampling` (gating) block, and the completeness windows `merge_grace` (§8.1) and `finalize_grace` (§8.3). +- **`StageRule`** — binds the pipeline to one lifecycle stage and declares the per-stage retention predicates: `stage` name, `apply_at` `StageEvent` set, `min_duration`, `keep_errors`, and `keep_tag_rules`. Set predicates are OR-combined: a trace survives if any one matches; it is dropped if it fails all set predicates. A rule with no predicates set at all has no filtering effect (see §4.2). +- **`StageEvent`** — enum of the lifecycle events a rule may run at: `STAGE_EVENT_COMPACTION` (§8.1), `STAGE_EVENT_FINALIZE` (§8.3), and `STAGE_EVENT_MIGRATION_OUT` (§8.2). +- **`TailSampling`** / **`TagSamplingRule`** — gating rules at finalization: `duration_threshold`, `keep_all_errors`, `healthy_sample_rate`, and `tag_rules` (each an `equals` / `in` / `regex` / `exists` matcher with a `sample_rate`). +- **`TagMatcher`** — sure-keep tag predicate used by `StageRule.keep_tag_rules`. Same matcher oneof as `TagSamplingRule` but carries no `sample_rate` — a match is an absolute keep. + +### 2.3 Uniqueness and Conflict Policy + +Multiple `TracePipelineConfig` resources can coexist in a Group, but **at most one active pipeline may target any given `(Group, Schema, Stage, StageEvent)` tuple**. The schema registry enforces this on write so the filter behavior at any tuple is deterministic — there is no implicit ordering, priority, or composition of overlapping pipelines. + +**Effective coverage of a pipeline.** A `TracePipelineConfig` P with `enabled=true` covers a tuple `(P.metadata.group, schema, stage, event)` when: + +- `schema` is a schema in `P.metadata.group` whose name is listed in `P.schema_names` or matches `P.schema_name_regex` (both empty selectors target every schema in the Group); +- `stage` is named by some entry in `P.stages` (empty `P.stages` covers every stage in the Group's `ResourceOpts`); +- `event` is listed in that stage rule's `apply_at` (empty `apply_at` covers every `StageEvent`). + +**Conflict detection on write.** When a `TracePipelineConfig` is created or updated, the registry expands its effective coverage against the current schemas and stages of its Group and rejects the write if **any** covered tuple is already covered by another `enabled` pipeline in the same Group. The error names the conflicting pipeline and the overlapping tuple(s). + +**Schema additions.** When a new `Trace` schema is created in a Group, the registry re-validates every `enabled` `TracePipelineConfig` in that Group against the new schema. If the new schema would cause two pipelines to overlap on any `(stage, event)`, the schema creation is rejected; the operator must narrow the pipelines (e.g. add explicit `schema_names`) before adding the schema. + +**Disabling and replacement.** Setting `enabled=false` removes a pipeline from active coverage immediately, freeing its tuples for another pipeline to claim. This is the safe path for atomically swapping in a new pipeline: disable the old, enable or create the new, then delete the old. There is no implicit precedence between two `enabled` pipelines — the registry will not pick a winner. + +**Runtime defense-in-depth.** At compaction (§8.1), finalization (§8.3), and migration (§8.2) the engine selects the active pipeline by `(group, schema, stage, event)`. If a registry inconsistency ever exposes more than one active match for a tuple, the engine logs an error and applies **none** of them — failing safe to retain rather than risk a nondeterministic destructive drop. + +**Recommended pattern.** For most workloads, a single pipeline per Group with a broad `schema_name_regex` is the simplest configuration. Multiple pipelines in a Group are useful only when their schema selectors are mutually exclusive (for example `schema_names: ["segment"]` and `schema_names: ["zipkin_span"]` in a hypothetical mixed group) so no tuple is doubly covered. + +### 2.4 Admission Validation + +The proto pins bounds with PGV so a malformed config is rejected at parse time rather than producing undefined behavior. The schema registry's admission control adds one cross-field rule that PGV cannot express. + +**PGV-enforced bounds** (rejected at proto parse / `Write` time): + +- `TailSampling.duration_threshold` — **required**, strictly positive (`gt: 0s`). +- `TailSampling.healthy_sample_rate`, `TagSamplingRule.sample_rate` — in `[0.0, 1.0]`. +- `StageRule.min_duration` — strictly positive when set (`gt: 0s`); unset means duration is not a retention factor for that stage. +- `TracePipelineConfig.merge_grace`, `TracePipelineConfig.finalize_grace` — strictly positive when set; unset means engine default (§8.1 / §8.3). + +**Admission rule (server-side; not expressible in PGV):** + +- **Required rule blocks when enabled.** When `TracePipelineConfig.enabled = true`, the config must declare at least one of: a non-empty `tail_sampling`, or at least one `StageRule` with at least one retention predicate (`min_duration`, `keep_errors=true`, or a non-empty `keep_tag_rules`). An enabled pipeline with no gating and no per-stage predicates has no effect on traces and is rejected as a configuration error. To temporarily disable a pipeline, set `enabled = false` — do not strip its rule blocks. + +Both the registry write path and the runtime apply these checks. If a registry inconsistency ever lets a malformed config reach the engine, the engine logs an error and falls back to **retain** for every tuple the malformed config would have governed — never a destructive drop (consistent with the §2.3 runtime fail-safe). + +## Retention-Decision Timing & Trace Completeness + +Both the tail-sampling gate (§1.1) and the per-stage retention predicates (§1.2) operate on the **whole** trace, not on individual spans as they arrive: `D_total` needs the earliest start and latest end, `has_error` scans all spans, and tag matchers scan all spans. Spans of one trace, however, arrive at the storage node anywhere from milliseconds to hours apart, and BanyanDB writes each span into the segment selected by its **event time** (the `start_time` tag), not its arrival time (`banyand/trace/write_standalone.go`). There is no live partial-trace buffer; a trace exists only as spans co-located by `trace_id` inside a segment's parts. + +### 3.1 When Retention Decisions Are Safe + +Predicates are not evaluated for destructive drops at write time. They run lazily, by the post-trace passes that re-assemble a trace by `trace_id`: + +- **Provisional** — during an LSM compaction merge (§8.1). This sees only the spans that have landed so far, so a drop verdict could be premature; the per-trace `merge_grace` gate (§8.1) defers any destructive drop until the trace has stopped growing. +- **Authoritative** — at the post-trace scheduler's finalization pass once the trace's segment has **settled**: its event-time window has closed *and* the event-time watermark has advanced past it by the `finalize_grace` period (§8.3). BanyanDB has no hard "segment sealed" event — `create()` pre-creates the next segment up to an hour *before* the current window ends, and late data keeps routing to an old segment by event time — so "settled" is a watermark heuristic, not a guarantee. +- **At migration_out** — by the time a segment is migrated to the next stage (≥ its stage's TTL, orders of magnitude longer than `finalize_grace`), every trace it holds has been settled, so predicate evaluation is final. + +Two completeness caveats follow from event-time segmentation: + +1. **Spans arriving after finalization.** A span whose `start_time` falls in an already-finalized segment is still accepted (while the segment is within retention) but is missed by the finalization-time evaluation. The `finalize_grace` period bounds how often this happens; a late span that lands after it is only re-evaluated if a subsequent compaction rewrites that part. + +2. **Traces spanning multiple segments.** A trace longer than `segment_interval`, or one that straddles a boundary, is physically split across two segments, and each segment's pass evaluates only its own fragment (`D_total` and the indicators are fragment-local). BanyanDB performs no cross-segment trace assembly at the storage layer. Where trace duration is far below `segment_interval` (e.g. the showcase's ≈2.8 s max trace against 1-day segments) the fragment equals the whole trace and this is a non-issue; for long-running traces it is a known limitation. + +## Integration of the Retention System & Time-Aging System + +Distributed telemetry's diagnostic utility naturally decays as time passes. The pipeline must therefore decide, per stage, what to keep — but it must **not** try to pick the physical storage medium, because in BanyanDB the medium is already determined by where a `LifecycleStage` is placed. + +### 4.1 Stage Placement vs. Rule-Governed Retention + +The physical medium is **not** fixed per tier by this design. Each `LifecycleStage` (`common.v1.LifecycleStage`) is routed to a node group through its `node_selector`; the medium is whatever hardware that node group runs, and operators may deploy heterogeneous backings (for example two warm node groups on different SSD classes). A common placement is Hot → local NVMe, Warm → SATA SSD/HDD, Cold → object storage, but that mapping is an operator deployment choice, not a rule this pipeline enforces. + +Because medium selection already belongs to stage placement, per-stage `StageRule` predicates **do not route traces to a medium**. Within whatever stage currently holds a trace, the predicates only act as a **retention gate** — dictating whether a trace is retained when a partition is rewritten or migrated, or discarded/pruned to reduce storage footprint. + +### 4.2 Stage-Stepped Retention via Rising Predicates + +"Aging" is expressed structurally: each `StageRule` declares its own predicates, and the bar rises as data moves to colder stages. **Semantics of a `StageRule` at one of its `apply_at` events:** the set predicates are OR-combined — a trace is retained if **any** set predicate matches it, and dropped if it fails all of them. A `StageRule` with no predicates configured at all has no filtering effect (every trace at that stage is retained); to actually drop traces at a stage, at least one of `min_duration`, `keep_errors`, or `keep_tag_rules` must be set. + +A typical profile rises Hot → Warm → Cold by tightening predicates at each stage's `MIGRATION_OUT` event. For example: + +- **Hot → Warm:** `min_duration: 100ms` OR `keep_errors` OR matches a key tag (PostgreSQL, ActiveMQ, etc.). +- **Warm → Cold:** `min_duration: 500ms` OR `keep_errors` OR matches the most-important tag. +- **In Cold (on compaction):** `keep_errors` only — slow-but-healthy traces are dropped, only errors persist for the full 30-day retention. + +Predicates are direct boolean checks, so retention is deterministic and self-explanatory: "this trace was retained at Warm because `db.type=PostgreSQL` matched, even though it was only 3 ms." Time enters only through *when* a partition reaches each stage, which is governed by `LifecycleStage.ttl` / `segment_interval`. + +## Downstream Lifecycle Actions Governed by Per-Stage Predicates + +The time-aging engine evaluates each trace against its current stage's `StageRule` predicates at the stage's configured lifecycle events, filtering data blocks as partitions are rewritten or migrated between lifecycle stages. + +### 6.1 Compaction Rewrites + +During normal LSM merge operations inside a stage, the data node compacts older parts. When a `StageRule` includes `STAGE_EVENT_COMPACTION` in its `apply_at`, the merge filter (§8.1) evaluates each mature trace against the stage's predicates and omits non-matching traces from the consolidated output. Compaction is then both a space-reclamation pass and a per-stage retention pass; pipelines that target only migration boundaries leave the routine LSM compaction byte-for-byte lossless. + +### 6.2 Partition-Level Tier Migration + +Both showcase trace groups (`sw_trace`, `sw_zipkinTrace`) declare a Hot → Warm → Cold lifecycle in `ResourceOpts.stages`: Hot holds the active window (1-day ttl), Warm runs on `node_selector type=warm` (7-day ttl), and Cold runs on `node_selector type=cold` with `close=true` (30-day ttl). When a Hot segment matures past its 1-day ttl, the migration worker copies it to the Warm node group; the medium behind each node group (e.g. NVMe → SATA SSD → object storage) is the operator's deployment choice (§4.1), not something this pipeline picks. + +- The migration engine evaluates the source stage's `MIGRATION_OUT` predicates against every trace in the partition. + +- **No dynamic splitting is performed:** all retained data is written to the next stage's node group. Traces that fail the source stage's `MIGRATION_OUT` predicates are omitted from the target write stream, reducing the physical size of the migrated partition. With Scenario 7.1's Hot predicates a healthy `/homepage` trace (2802 ms) matches `min_duration: 100ms` and migrates to Warm; a PostgreSQL-touching trace matches `keep_tag_rules` and migrates too; a healthy fast trace (6 ms) matches none and is dropped. + +- When the partition matures past the Warm 7-day ttl, the Warm stage's `MIGRATION_OUT` predicates apply (typically stricter — e.g. `min_duration: 500ms`); only matching traces are written into the Cold parts. Healthy baseline traces, even ones that survived Warm, are dropped here; error traces (with `keep_errors: true`) persist all the way to Cold. + +### 6.3 Eviction: Per-Trace Drop During Rewrites, Segment-Granularity GC + +BanyanDB does not tombstone individual traces, and this design does not add per-trace tombstones. The trace part layout is column-oriented (separate `primary`, `spans`, and `tags` streams), so there is no per-trace delete primitive. Eviction therefore happens at two distinct granularities: + +- **Per-trace removal is a side effect of the filter rewrites, not GC.** A trace that fails its current stage's retention predicates is simply omitted the next time its part is rewritten — during a compaction merge (§8.1), a segment-finalization pass (§8.3), or a pre-migration rewrite (§8.2). No tombstone is written; the trace's columns are not copied into the new part, and the space is reclaimed when the old part is retired. + +- **Whole-segment eviction stays the existing mechanism.** Reclaiming an entire time segment remains governed by the current retention path — TTL expiry plus the disk high/low watermarks (`banyand/trace/svc_standalone.go`, `TopicDeleteExpiredTraceSegments`). Per-stage predicates do not place tombstones and do not trigger segment deletion; they only change how much of a segment survives each rewrite. + +## Operational Scenario Configurations + +Below are two operational scenarios represented as complete `TracePipelineConfig` instances, one per trace group found in the skywalking-showcase cluster (`sw_trace` and `sw_zipkinTrace`). Group names, schema names, tags, latencies, and lifecycle stages are taken from real data in that cluster; the retention outcomes quoted are derived by evaluating the per-stage predicates against real sampled traces. + +### 7.1 Scenario 1: SkyWalking-Native Segment Retention (`sw_trace`) + +- **Objective**: On the showcase `sw_trace` group (schema `segment`), keep every error trace all the way to Cold for incident forensics, keep genuinely slow requests through Warm, always keep traces that touch PostgreSQL or the ActiveMQ `queue-songs-ping` queue at the Hot→Warm boundary, and probabilistically sample the healthy remainder at gating. The group's real Hot → Warm → Cold stages (ttl 1d / 7d / 30d) carry rising predicate strictness at each `MIGRATION_OUT` event. + +- **Configuration JSON**: + +```Plain Text +{ + "metadata": { "group": "sw_trace", "name": "segment-tail-sampler" }, + "enabled": true, + "stages": [ + { + "stage": "hot", + "apply_at": ["STAGE_EVENT_MIGRATION_OUT"], + "min_duration": "0.100s", + "keep_errors": true, + "keep_tag_rules": [ + { "tag_key": "db.type", "equals": "PostgreSQL" }, + { "tag_key": "mq.queue", "equals": "queue-songs-ping" } + ] + }, + { + "stage": "warm", + "apply_at": ["STAGE_EVENT_MIGRATION_OUT"], + "min_duration": "0.500s", + "keep_errors": true, + "keep_tag_rules": [ + { "tag_key": "db.type", "equals": "PostgreSQL" } + ] + }, + { + "stage": "cold", + "apply_at": ["STAGE_EVENT_COMPACTION"], + "keep_errors": true + } + ], + "schema_names": ["segment"], + "tail_sampling": { + "duration_threshold": "0.500s", + "keep_all_errors": true, + "healthy_sample_rate": 0.1, + "tag_rules": [ + { "tag_key": "db.type", "equals": "PostgreSQL", "sample_rate": 1.0 }, + { "tag_key": "mq.queue", "equals": "queue-songs-ping", "sample_rate": 1.0 } + ] + }, + "merge_grace": "30s", + "finalize_grace": "300s" +} +``` + +- **Retention Dynamics** (real `sw_trace` traces, predicates evaluated at each stage's lifecycle event): + + - The error trace `5fcdb353-…` (`POST /test`, `agent::app`, `is_error=1`, 4 ms) is a sure-keep at gating (`keep_all_errors`). At Hot→Warm migration, `keep_errors` matches → kept. At Warm→Cold migration, `keep_errors` matches → kept. At Cold compaction, `keep_errors` matches → retained for the full 30-day Cold TTL. + + - The slow healthy trace `b03bb932-…` (`/homepage`, `agent::ui` → `agent::frontend`, 2802 ms) is a sure-keep at gating (2802 ms > 500 ms `duration_threshold`). At Hot→Warm: 2802 ms ≥ 100 ms `min_duration` → kept. At Warm→Cold: 2802 ms ≥ 500 ms → kept. At Cold compaction: no error, `keep_errors` fails → **dropped from Cold**. + + - A PostgreSQL-touching trace (e.g. `b31e4be8-…`, `agent::songs` `UndertowDispatch`, 3 ms, `db.type=PostgreSQL`) is sure-kept by the `db.type` tag rule at gating. At Hot→Warm: `keep_tag_rules` matches → kept. At Warm→Cold: `keep_tag_rules` (still includes `db.type`) matches → kept. At Cold compaction: no error → **dropped from Cold**. + + - A healthy fast trace such as `GET:/songs` at 6 ms (`agent::songs`, `http.status_code=200`) is only kept at gating if it wins the `healthy_sample_rate` (`0.1`) hash. If kept, at Hot→Warm: 6 ms < 100 ms, no error, no tag match → **dropped at Hot→Warm migration**. + +### 7.2 Scenario 2: Istio / Zipkin Mesh Edge Sampling (`sw_zipkinTrace`) + +- **Objective**: On the showcase `sw_zipkinTrace` group (schema `zipkin_span`), apply a lower-cost edge sampler to the Istio service-mesh spans. The Zipkin schema has no first-class `is_error` column, so server errors are caught with a tag rule on the flattened `query` attributes rather than `keep_errors`; mesh gateway spans are kept by tag. Targets the group's Warm and Cold stages. + +- **Configuration JSON**: + +```Plain Text +{ + "metadata": { "group": "sw_zipkinTrace", "name": "zipkin-edge-sampler" }, + "enabled": true, + "stages": [ + { + "stage": "warm", + "apply_at": ["STAGE_EVENT_MIGRATION_OUT"], + "min_duration": "1s", + "keep_tag_rules": [ + { "tag_key": "query", "regex": "http\\.status_code=5\\d\\d" }, + { "tag_key": "local_endpoint_service_name", "equals": "gateway.sample-services" } + ] + }, + { + "stage": "cold", + "apply_at": ["STAGE_EVENT_COMPACTION"], + "keep_tag_rules": [ + { "tag_key": "query", "regex": "http\\.status_code=5\\d\\d" } + ] + } + ], + "schema_names": ["zipkin_span"], + "tail_sampling": { + "duration_threshold": "1.000s", + "keep_all_errors": false, + "healthy_sample_rate": 0.05, + "tag_rules": [ + { "tag_key": "query", "regex": "http\\.status_code=5\\d\\d", "sample_rate": 1.0 } + ] + }, + "merge_grace": "30s", + "finalize_grace": "300s" +} +``` + +- **Retention Dynamics** (real `sw_zipkinTrace` spans): + + - The slowest mesh call observed — `trace_id 0961e077…`, a 30.7 s `istio.skywalking-showcase` client span to Grafana's live-WS endpoint (`http.status_code=101`) — is a sure-keep at gating (30.7 s > 1 s). At Warm→Cold migration: 30.7 s ≥ 1 s `min_duration` → kept. At Cold compaction: `query` carries `http.status_code=101` (not 5xx), no other predicate → **dropped from Cold**. + + - A gateway span on `gateway.sample-services` at the mesh p90 (~19 ms): would be passed by gating only via the `0.05` `healthy_sample_rate` hash (19 ms < 1 s, no 5xx). If kept, at Warm→Cold: 19 ms < 1 s, but `local_endpoint_service_name = gateway.sample-services` matches → kept. At Cold compaction: no 5xx → **dropped from Cold**. + + - A typical p50 mesh span (~2 ms) is kept at gating only via the `0.05` sample. At Warm→Cold: 2 ms < 1 s, not a gateway span, no 5xx → **dropped at Warm→Cold**. + + - Any span carrying a `5xx` in its `query` attributes is sure-kept at gating, kept at Warm→Cold via the `query` tag matcher, and kept at Cold compaction via the same matcher — preserved for the full Cold TTL. Whole Warm/Cold segments are still reclaimed on their own schedule by `LifecycleStage.ttl`. + +## Post-Trace Data Flow Architecture + +The execution of the post-trace pipeline occurs natively on BanyanDB Data Nodes via three distinct pathways, matching storage lifecycle transitions. + +```Plain Text +[ Raw Ingestion ] -> (Liaison Node) -> [ Fast Local Write ] -> (Data Node Memory Buffer) + | ++----------------------------------------------------------------------+ +| +v +[ Storage Node Post-Trace Processing Loops ] +| ++---> 8.1. LSM Compaction Merge Filter Hook (in-line filter during merge; lossless unless targeted) +| ++---> 8.2. Pre-Migration Filter Rewrite (rewrite segment to a reduced part, then migrate it unchanged) +| ++---> 8.3. Scheduled Final Filter on a Settled Segment (watermark past window + grace; best-effort final) +``` + +### 8.1 LSM Compaction Merge Phase Loop (In-line Merge Filter Hook) + +During LSM file compaction, immature part files containing segmented trace blocks are merged into unified, consolidated parts. The merge loop already streams blocks grouped by `trace_id`, so it is the ideal place to evaluate traces without a separate read pass. + +> **Design decision (accepted):** The trace LSM merger is refactored to expose a **trace-filter hook**. The pipeline plugs a filter into the existing merge stream so that, when a `TracePipelineConfig` with a `StageRule` whose `apply_at` includes `STAGE_EVENT_COMPACTION` (or is empty) targets the schema being compacted, blocks belonging to dropped traces are omitted from the consolidated output. Without a targeting pipeline the hook is a no-op and the merge stays byte-for-byte lossless, preserving the current LSM correctness guarantee. + +#### Integration point in the real merger + +The hook is injected into `mergeBlocks` (`banyand/trace/merger.go`), which is the single point where every emitted block flows to the block writer. Both write paths are wrapped by the filter: + +- `blockWriter.mustWriteRawBlock` — the fast path that copies a single-block trace as raw bytes without decoding. +- `blockWriter.mustWriteBlock` — the slow path that emits a decoded, accumulated block for a `trace_id`. + +```Plain Text +[ Immature Parts ] --> [ blockReader.nextBlockMetadata ] (streams blocks ordered by trace_id) + | + v +[ Per-trace assembly (existing pending-block accumulation in mergeBlocks) ] + | + v +[ Mature? traceMaxTs (timestampsMetadata.max) < now - merge_grace ] + | | + | NO (trace may still grow) | YES (stopped growing) + v v +[ RETAIN unchanged ] [ TraceFilter hook ] --- keyed on trace_id, cached + | - evaluate the stage's StageRule predicates + | (min_duration / keep_errors / keep_tag_rules) + | + +--> RETAIN --> mustWriteBlock / mustWriteRawBlock + | + +--> DROP --> blocks skipped; spans, tags, primary + entries never written (space reclaimed + in-stream, no delete mutation) +``` + +#### Filter contract and what the merge already gives us for free + +1. **Duration is free; errors and tags cost a decode.** Block metadata carries `timestampsMetadata{min,max}` (`block_metadata.go`), so `D_total` is derivable on the raw fast path without unmarshaling — `min_duration` checks are free. `keep_errors` and `keep_tag_rules` predicates inspect span bodies, so any pipeline using them forces the decoded slow path (`loadBlockData`) for the targeted schema. Duration-only stage rules can keep the raw fast path. + +2. **Decisions are per `trace_id`, not per block.** A single trace may be emitted as multiple blocks when its accumulated span size crosses `maxUncompressedSpanSize`. The filter computes one retain/drop verdict per `trace_id` and applies it to **every** block carrying that id, so a trace is never partially written. + +3. **A trace is dropped only after it stops growing (per-trace grace).** Dropping blocks makes a compaction intentionally lossy, and compaction is triggered by part accumulation (`getPartsToMerge`), not by time — so a merge routinely runs on the active write window while a trace's remaining spans are still seconds away. Dropping such a trace on its partial spans would orphan the late ones. The filter therefore evaluates a trace only once its **latest span timestamp is older than `now − merge_grace`** (the trace's max timestamp is read for free from `timestampsMetadata.max`, point 1); traces newer than that frontier are passed through the merge unchanged. The filter is additionally gated on targeting — consulted only when the snapshot's group/schema matches a pipeline with a `StageRule` whose `apply_at` includes `STAGE_EVENT_COMPACTION`; otherwise the default compaction path is unchanged. This `merge_grace` is a **per-trace** maturity window, distinct from the **per-segment** ` finalize_grace` of §8.3: the former bounds intra-trace span spread, the latter bounds segment-wide late arrival. Both are fields of `TracePipelineConfig` (`merge_grace`, `finalize_grace`); the engine applies documented defaults (`30s` and `5m` respectively) when a field is unset. + +4. **Derived part state must reflect only retained traces.** When a trace is dropped, its entries are excluded from `partMetadata` counts, the `traceIDFilter` bloom filter (`mustWriteTraceIDFilter`), and the `tagType` set written at `Flush`. The filter runs before these are finalized so the consolidated part stays self-consistent. + +5. **Secondary indexes are pruned in lockstep.** `mergePartsThenSendIntroduction` merges sidx parts separately from the trace blocks (`banyand/trace/merger.go`). The set of dropped `trace_id`s is propagated to the sidx merge so dropped traces are not left dangling in secondary indexes. + +6. **Drops are final and the merge is crash-safe.** Because a trace is dropped only after `merge_grace` has elapsed since its last span (point 3), the verdict acts on a trace that has stopped growing, so the drop is final rather than premature. The merge writes the new part atomically and only retires source parts after the introduction is applied, so a crash mid-merge leaves the immature parts intact for retry; re-running compaction on an already-filtered part is a no-op. + +### 8.2 Hot-to-Warm Tier Migration (Pre-Migration Filter Rewrite) + +When the lifecycle agent migrates a segment to the next stage, the existing transfer is an opaque byte stream of part directories — it is not a per-trace filter. Today's path walks shards via `traceMigrationVisitor.VisitShard` (`banyand/backup/lifecycle/trace_migration_visitor.go`), reads source parts with `generateAllPartData` (which opens part files via `trace.CreatePartFileReaderFromPath`, `banyand/trace/part.go`), and ships them with `streamPartToTargetShard` over a chunked sync client to the destination replicas. + +> **Design decision (accepted):** Do **not** filter inside the byte-copy transfer. Instead, run a **pre-migration filter pass** on the source segment that reads the existing parts through the pipeline and writes a **new, reduced part**; the unchanged migration then streams that new part to the next stage. This keeps "migration" a faithful byte copy and isolates all lossy behaviour in a dedicated rewrite step. + +```Plain Text +[ Hot Tier segment (settled; >= stage TTL old) ] + | + v +[ Pre-migration rewrite pass ] (new visitor over the segment's parts) + | read each part: blockReader over CreatePartFileReaderFromPath + | apply pipeline filter (same trace-filter contract as §8.1) + | write retained traces to a NEW reduced part via blockWriter + v +[ Reduced part on Hot tier ] + | + v +[ Existing migration (unchanged) ] VisitShard -> generateAllPartData -> streamPartToTargetShard + | opaque chunked byte copy of the reduced part + v +[ Warm tier segment ] +``` + +1. The pre-migration pass attaches between `generateAllPartData` and `streamPartToTargetShard` in `VisitShard`: each source part is read with the same `blockReader`/`blockWriter` machinery the merger uses (§8.1), the pipeline filter is applied, and a reduced part is produced in place of the source. + +2. The filter is the same `STAGE_EVENT_MIGRATION_OUT`-gated trace-filter contract as §8.1 — per-`trace_id` retain/drop verdicts, with dropped traces excluded from `partMetadata`, the `traceIDFilter` bloom filter, `tagType`, and the parallel sidx parts. Because a segment only migrates once it is at least its stage TTL old (e.g. a Hot segment after ~1 day), its event-time window closed long ago and any late arrivals have settled, so every trace is effectively complete and the verdict is final. + +3. The reduced part — not the original — is what `streamPartToTargetShard` transfers. The migration protocol, replica fan-out, and destination registration are untouched, so the considerable savings come purely from shipping fewer bytes to the next tier. + +### 8.3 Scheduled Final Filter on a Settled Segment + +Filtering during an LSM merge (§8.1) sees only the spans present in the parts being merged; a trace whose remaining spans are still arriving could be judged prematurely. The **final** tail-sampling decision must therefore run only once a segment is unlikely to gain more spans. The catch: **BanyanDB has no "segment sealed" signal** to trigger this, so the pass is driven by an event-time watermark and a grace period instead. + +Why there is no seal event (grounding in `banyand/internal/storage/rotation.go`): + +- **Writes route by event time, so old segments keep receiving data.** A span is written into the segment whose window contains its `start_time` (`banyand/trace/write_standalone.go` → `CreateSegmentIfNotExist(time.Unix(0, ts))`), not its arrival time. A span that arrives hours late still lands in its old event-time segment for as long as that segment exists. +- **Segment creation is look-ahead, not rollover.** The rotation loop pre-creates the *next* segment up to `creationGap` (1 hour) **before** the current window ends: it fires `segmentController.create` only while `0 < latest.End - eventTime <= newSegmentTimeGap`. At that moment the current segment is still the active write target, so `create()` cannot mean "the previous segment is sealed." +- **`closeIdleSegments` is not a seal.** It only releases idle in-memory handles after an idle timeout (`banyand/internal/storage/segment.go`) and reopens them on the next access; it says nothing about window completeness. +- **The only definitive boundary is TTL removal** (`retentionTask`, cron `5 0`), which deletes the whole segment — far too late to be a finalization trigger. + +> **Design decision (accepted):** Drive the final filter from an **event-time watermark plus a settling grace period**, not from a (non-existent) seal event. The rotation path already tracks the maximum observed event time (`latestTickTime`, fed by `Tick`, `rotation.go`). A segment is treated as **settled** once `watermark > segment.End + finalize_grace`, where `finalize_grace` is `TracePipelineConfig.finalize_grace` — a configured expected-late-arrival lag (engine default `5m` if unset), distinct from the per-trace `merge_grace` of §8.1. A new post-trace scheduler — registered on `pkg/timestamp.Scheduler` (cron-backed, like `retentionTask`) — periodically scans for settled-but-unfinalized segments and runs the final filter once per segment. This is the proto's `STAGE_EVENT_FINALIZE`. It is **best-effort final**, not a hard guarantee: data arriving after `finalize_grace` is missed (§3.1, caveat 1). The finalization pass is also the point where the tail-sampling gate (`tail_s ampling`) is applied — traces failing both the gate and any `STAGE_EVENT_FINALIZE` `StageRule` predicates are dropped here. Review Comment: The finalization description says traces are dropped only if they fail both the tail-sampling gate and any `STAGE_EVENT_FINALIZE` `StageRule` predicates, which implies StageRule can override a tail-sampling drop. Earlier sections describe tail-sampling as deciding whether a trace is retained at all; please clarify the exact composition/order at FINALIZE (e.g., is tail_sampling an absolute gate, or is retention the union of tail_sampling and StageRule sure-keeps?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
