This is an automated email from the ASF dual-hosted git repository. hanahmily pushed a commit to branch phase-2-cp5-march in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
commit 7ca9ab0d076e9cb70ee27774c623f418383a0f49 Author: Hongtao Gao <[email protected]> AuthorDate: Mon May 4 11:35:28 2026 +0000 test(schema): re-enable §4.6.2 in distributed; defer §6.8 / §6.11 / §4.6.4 to Step 2.5 Phase 2 §RE-1 — first attempt at unskipping the four Phase-1-deferred distributed schema specs against the new cluster barrier: - §4.6.2 (clamp.go: "succeeds and returns zero elements when query spans schema CreatedAt"): NOW PASSES in distributed mode. The clamp + Query flow has no Write→Query baseline; the cluster barrier alone is enough to make this spec deterministic. - §6.8 (shape_break.go: "delete+apply new shape creates the new measure"), §6.11 (shape_break.go: "delete-then-recreate original shape drops old data"), §4.6.4 (clamp.go: "clips TimeRange.Begin to max(CreatedAt) and excludes pre-creation data") all STILL FLAKE. Each one's baseline step is a Create → AwaitRevision → Write → Query round-trip expecting the written datum back; the post-barrier query races the data-node write path independently of schema propagation. Empirically reproduced in a full distributed integration run: the schema barrier converges, then the very next Query returns 0 data points instead of 1. This matches the plan's §RE-1 forecast — these three specs are blocked on Step 2.5's cluster-wide query gate, not on the barrier itself. Per the plan: "for each that still flakes, leave the guard in place and add a one-sentence comment pointing to Step 2.5's cache-layer prerequisite." Updated each guard's comment to point at Step 2.5 rather than "Phase 2.1–2.2 once those land." CHANGES.md: extend the Phase 2 sub-bullet under 0.11.0 to cover §SS-1..4 (mid-call eviction / leave / late-join + NodeLaggard.reason proto field), §FA-1/FA-2 (AwaitSchemaApplied cluster fan-out), §FD-1/FD-2 (AwaitSchemaDeleted cluster fan-out), and the §4.6.2 re-enable. via [HAPI](https://hapi.run) --- CHANGES.md | 7 +++++-- test/cases/schema/clamp.go | 24 +++++------------------- test/cases/schema/shape_break.go | 23 +++++++++-------------- 3 files changed, 19 insertions(+), 35 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index 3195f1d79..b67f064bf 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -31,10 +31,13 @@ Release Notes. - Add tombstone retention/GC (default 7 days, configurable via `--schema-server-tombstone-retention`) with a per-cache count cap to bound memory under bulk deletes. - Reject `Create` with `updated_at <= tombstone.delete_time` to prevent replayed creates from overwriting newer deletes. - Guard `pkg/schema/cache` against out-of-order `EventDelete` events; expose monotonic `LatestModRevision` watermark. -- Schema consistency (Phase 2 in progress): cluster-wide barrier groundwork. Internal-only; no client-facing surface impact yet. - - Add `NodeSchemaStatusService` (`GetMaxRevision`, `GetKeyRevisions`, `GetAbsentKeys`) registered on every cluster member that holds a schema cache, so peer liaisons and data nodes can be probed identically by the upcoming barrier fan-out (#1108). +- Schema consistency (Phase 2 in progress): cluster-wide barrier. Internal-only; no client-facing surface impact yet. + - Add `NodeSchemaStatusService` (`GetMaxRevision`, `GetKeyRevisions`, `GetAbsentKeys`) registered on every cluster member that holds a schema cache, so peer liaisons and data nodes can be probed identically by the barrier fan-out (#1108). - Extend `queue.Client` with `NewNodeSchemaStatusClient(node)` so the barrier fan-out can borrow the existing tier1/tier2 connection pools instead of opening a parallel mesh (#1109). - `AwaitRevisionApplied` now fans out across the receiving liaison's frozen tier1 (peer-liaison) + tier2 (data-node) Active set, probing each member in parallel via `GetMaxRevision` with shared per-call deadline. Cross-version peers returning `codes.Unimplemented` are treated as ready so partial-upgrade clusters do not deadlock; transient RPC errors count as per-iteration laggards. Empty Active set fails fast with `codes.Unavailable`. + - Frozen-snapshot mid-call semantics: members that transition `Active → Evictable` during a call are dropped from subsequent probes and surfaced once as a `NodeLaggard{reason="evicted_during_poll"}`; members that disappear from the route table altogether are dropped silently; late joiners are excluded from the watched set until the next call. Adds `reason` field (5) to `NodeLaggard` proto. + - `AwaitSchemaApplied` and `AwaitSchemaDeleted` follow the same fan-out shape using `GetKeyRevisions` / `GetAbsentKeys` respectively, with per-node calls chunked at 1000 keys and a shared call-wide deadline (no equal-slice division across chunks). Per-node laggards carry the per-member `missing_keys` / `still_present_keys` they observed. + - Re-enable §4.6.2 distributed spec (no Write→Query baseline). §6.8, §6.11, §4.6.4 remain skipped pending Step 2.5's cluster query gate. ### Bug Fixes diff --git a/test/cases/schema/clamp.go b/test/cases/schema/clamp.go index 90bf30940..db21b8717 100644 --- a/test/cases/schema/clamp.go +++ b/test/cases/schema/clamp.go @@ -116,17 +116,6 @@ var _ = g.Describe("Schema time-range clamp", func() { // server clamps Begin forward to CreatedAt and the query executes successfully. // Since no data was written the response has zero elements but no error. g.It("succeeds and returns zero elements when query spans schema CreatedAt (§4.6.2)", func() { - // TODO(phase-2): Phase 1 AwaitRevisionApplied / AwaitSchemaApplied are liaison-only - // by design. Unlike §4.6.1 and §4.6.3 (both ends far in the past, where the clamp - // short-circuits at the liaison and never dispatches to data nodes), this spec uses - // End=now+1h so the clamped range is non-empty and the query is dispatched. In - // distributed mode the data node can lag the liaison's schema view at that moment, - // causing the dispatched query to fail with "group not found". Cluster-wide barrier - // semantics ship in Phase 2 via NodeSchemaStatusService + liaison fan-out (plan - // Steps 2.1–2.2); re-enable this spec in distributed mode once those land. - if SharedContext.Mode == helpers.ModeDistributed { - g.Skip("§4.6.2 requires cluster-wide propagation barrier (Phase 2)") - } groupName := fmt.Sprintf("clamp-span-%d", time.Now().UnixNano()) streamName := "clamp_stream" @@ -226,15 +215,12 @@ var _ = g.Describe("Schema time-range clamp", func() { // inside [Begin, End] and the datum would leak — proving the clamp is actually // applied rather than merely consistent with an already-in-range write. g.It("clips TimeRange.Begin to max(CreatedAt) and excludes pre-creation data (§4.6.4)", func() { - // TODO(phase-2): Phase 1 AwaitRevisionApplied is liaison-only by design. This spec's - // baseline sanity check (Create → AwaitRevision → Write → Query expecting HaveLen(1)) - // races the data-node tsTable readiness in distributed mode. The actual clamp - // falsification is sound; only the prerequisite Write→Query round-trip flakes. - // Cluster-wide barrier semantics ship in Phase 2 via NodeSchemaStatusService + - // liaison fan-out (plan Steps 2.1–2.2); re-enable this spec in distributed mode - // once those land. + // Phase 2.2 barrier ensures schema is on every node, but this spec's + // baseline sanity step (Create → Write → Query expecting 1 datum) + // still races the data-node write path in distributed mode. Re-enable + // once Step 2.5 (cluster query gate) lands. if SharedContext.Mode == helpers.ModeDistributed { - g.Skip("§4.6.4 requires cluster-wide propagation barrier (Phase 2)") + g.Skip("§4.6.4 requires the cluster-wide query gate (Phase 2 Step 2.5)") } group1 := fmt.Sprintf("clamp-leak1-%d", time.Now().UnixNano()) group2 := fmt.Sprintf("clamp-leak2-%d", time.Now().UnixNano()) diff --git a/test/cases/schema/shape_break.go b/test/cases/schema/shape_break.go index 192bad7ab..11693fee0 100644 --- a/test/cases/schema/shape_break.go +++ b/test/cases/schema/shape_break.go @@ -160,14 +160,12 @@ var _ = g.Describe("Schema shape-break rejection", func() { // §6.8: shape-break — delete+apply new shape creates the new measure (Rule 7 clamp end-to-end). g.It("shape-break: delete+apply new shape creates the new measure (§6.8)", func() { - // TODO(phase-2): Phase 1 AwaitRevisionApplied is liaison-only by design. Both this spec - // and §6.11 perform an end-to-end data Write+Query round-trip through the liaison after - // a schema mutation; in distributed mode the data node can lag the liaison briefly on - // tsTable readiness or query-side index refresh, racing the immediate Write→Query. - // Cluster-wide barrier semantics ship in Phase 2 via NodeSchemaStatusService + liaison - // fan-out (plan Steps 2.1–2.2); re-enable this spec in distributed mode once those land. + // Phase 2.2 cluster barrier confirms schema propagation across nodes, + // but this spec's Write→Query baseline races the data-node write path + // independently of the schema barrier. Re-enable in distributed mode + // once Step 2.5 (cluster query gate) lands. if SharedContext.Mode == helpers.ModeDistributed { - g.Skip("§6.8 requires cluster-wide propagation barrier (Phase 2)") + g.Skip("§6.8 requires the cluster-wide query gate (Phase 2 Step 2.5)") } groupName := fmt.Sprintf("sb-new-%d", time.Now().UnixNano()) measureName := "throughput" @@ -410,14 +408,11 @@ var _ = g.Describe("Schema shape-break rejection", func() { // §6.11: delete-then-recreate original shape drops old data (Rule 7 clamp). g.It("delete-then-recreate original shape drops old data (§6.11)", func() { - // TODO(phase-2): Phase 1 AwaitRevisionApplied is liaison-only by design — it confirms - // the liaison cache observes R2 but not that every data node has rebuilt the tsTable - // after the delete-then-recreate. The post-recreate Write+Query round-trip exercised - // by this spec races the data-node tsTable rebuild on slow CI runners. Cluster-wide - // barrier semantics ship in Phase 2 via NodeSchemaStatusService + liaison fan-out - // (plan Steps 2.1–2.2); re-enable this spec in distributed mode once those land. + // Same write→query race as §6.8 — Phase 2.2's barrier ensures schema + // coherence but the post-recreate Write→Query baseline still flakes + // without the cluster query gate. Re-enable once Step 2.5 lands. if SharedContext.Mode == helpers.ModeDistributed { - g.Skip("§6.11 requires cluster-wide propagation barrier (Phase 2)") + g.Skip("§6.11 requires the cluster-wide query gate (Phase 2 Step 2.5)") } groupName := fmt.Sprintf("sb-same-%d", time.Now().UnixNano()) measureName := "throughput"
