This is an automated email from the ASF dual-hosted git repository.

hanahmily pushed a commit to branch phase-2-cp5-march
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git

commit 47006561f204a04c163e0494d374ba8541dd8e6e
Author: Hongtao Gao <[email protected]>
AuthorDate: Mon May 4 11:35:28 2026 +0000

    test(schema): re-enable §4.6.2 in distributed; defer §6.8 / §6.11 / §4.6.4 
to Step 2.5
    
    Phase 2 §RE-1 — first attempt at unskipping the four Phase-1-deferred
    distributed schema specs against the new cluster barrier:
    
    - §4.6.2 (clamp.go: "succeeds and returns zero elements when query spans
      schema CreatedAt"): NOW PASSES in distributed mode. The clamp + Query
      flow has no Write→Query baseline; the cluster barrier alone is enough
      to make this spec deterministic.
    - §6.8 (shape_break.go: "delete+apply new shape creates the new measure"),
      §6.11 (shape_break.go: "delete-then-recreate original shape drops old
      data"), §4.6.4 (clamp.go: "clips TimeRange.Begin to max(CreatedAt) and
      excludes pre-creation data") all STILL FLAKE. Each one's baseline step
      is a Create → AwaitRevision → Write → Query round-trip expecting the
      written datum back; the post-barrier query races the data-node write
      path independently of schema propagation. Empirically reproduced in a
      full distributed integration run: the schema barrier converges, then
      the very next Query returns 0 data points instead of 1.
    
      This matches the plan's §RE-1 forecast — these three specs are blocked
      on Step 2.5's cluster-wide query gate, not on the barrier itself. Per
      the plan: "for each that still flakes, leave the guard in place and
      add a one-sentence comment pointing to Step 2.5's cache-layer
      prerequisite." Updated each guard's comment to point at Step 2.5
      rather than "Phase 2.1–2.2 once those land."
    
    CHANGES.md: extend the Phase 2 sub-bullet under 0.11.0 to cover §SS-1..4
    (mid-call eviction / leave / late-join + NodeLaggard.reason proto field),
    §FA-1/FA-2 (AwaitSchemaApplied cluster fan-out), §FD-1/FD-2
    (AwaitSchemaDeleted cluster fan-out), and the §4.6.2 re-enable.
    
    via [HAPI](https://hapi.run)
---
 CHANGES.md                       |  7 +++++--
 test/cases/schema/clamp.go       | 24 +++++-------------------
 test/cases/schema/shape_break.go | 23 +++++++++--------------
 3 files changed, 19 insertions(+), 35 deletions(-)

diff --git a/CHANGES.md b/CHANGES.md
index daad0e5da..e426b742e 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -31,10 +31,13 @@ Release Notes.
   - Add tombstone retention/GC (default 7 days, configurable via 
`--schema-server-tombstone-retention`) with a per-cache count cap to bound 
memory under bulk deletes.
   - Reject `Create` with `updated_at <= tombstone.delete_time` to prevent 
replayed creates from overwriting newer deletes.
   - Guard `pkg/schema/cache` against out-of-order `EventDelete` events; expose 
monotonic `LatestModRevision` watermark.
-- Schema consistency (Phase 2 in progress): cluster-wide barrier groundwork. 
Internal-only; no client-facing surface impact yet.
-  - Add `NodeSchemaStatusService` (`GetMaxRevision`, `GetKeyRevisions`, 
`GetAbsentKeys`) registered on every cluster member that holds a schema cache, 
so peer liaisons and data nodes can be probed identically by the upcoming 
barrier fan-out (#1108).
+- Schema consistency (Phase 2 in progress): cluster-wide barrier. 
Internal-only; no client-facing surface impact yet.
+  - Add `NodeSchemaStatusService` (`GetMaxRevision`, `GetKeyRevisions`, 
`GetAbsentKeys`) registered on every cluster member that holds a schema cache, 
so peer liaisons and data nodes can be probed identically by the barrier 
fan-out (#1108).
   - Extend `queue.Client` with `NewNodeSchemaStatusClient(node)` so the 
barrier fan-out can borrow the existing tier1/tier2 connection pools instead of 
opening a parallel mesh (#1109).
   - `AwaitRevisionApplied` now fans out across the receiving liaison's frozen 
tier1 (peer-liaison) + tier2 (data-node) Active set, probing each member in 
parallel via `GetMaxRevision` with shared per-call deadline. Cross-version 
peers returning `codes.Unimplemented` are treated as ready so partial-upgrade 
clusters do not deadlock; transient RPC errors count as per-iteration laggards. 
Empty Active set fails fast with `codes.Unavailable`.
+  - Frozen-snapshot mid-call semantics: members that transition `Active → 
Evictable` during a call are dropped from subsequent probes and surfaced once 
as a `NodeLaggard{reason="evicted_during_poll"}`; members that disappear from 
the route table altogether are dropped silently; late joiners are excluded from 
the watched set until the next call. Adds `reason` field (5) to `NodeLaggard` 
proto.
+  - `AwaitSchemaApplied` and `AwaitSchemaDeleted` follow the same fan-out 
shape using `GetKeyRevisions` / `GetAbsentKeys` respectively, with per-node 
calls chunked at 1000 keys and a shared call-wide deadline (no equal-slice 
division across chunks). Per-node laggards carry the per-member `missing_keys` 
/ `still_present_keys` they observed.
+  - Re-enable §4.6.2 distributed spec (no Write→Query baseline). §6.8, §6.11, 
§4.6.4 remain skipped pending Step 2.5's cluster query gate.
 
 ### Bug Fixes
 
diff --git a/test/cases/schema/clamp.go b/test/cases/schema/clamp.go
index 90bf30940..db21b8717 100644
--- a/test/cases/schema/clamp.go
+++ b/test/cases/schema/clamp.go
@@ -116,17 +116,6 @@ var _ = g.Describe("Schema time-range clamp", func() {
        // server clamps Begin forward to CreatedAt and the query executes 
successfully.
        // Since no data was written the response has zero elements but no 
error.
        g.It("succeeds and returns zero elements when query spans schema 
CreatedAt (§4.6.2)", func() {
-               // TODO(phase-2): Phase 1 AwaitRevisionApplied / 
AwaitSchemaApplied are liaison-only
-               // by design. Unlike §4.6.1 and §4.6.3 (both ends far in the 
past, where the clamp
-               // short-circuits at the liaison and never dispatches to data 
nodes), this spec uses
-               // End=now+1h so the clamped range is non-empty and the query 
is dispatched. In
-               // distributed mode the data node can lag the liaison's schema 
view at that moment,
-               // causing the dispatched query to fail with "group not found". 
Cluster-wide barrier
-               // semantics ship in Phase 2 via NodeSchemaStatusService + 
liaison fan-out (plan
-               // Steps 2.1–2.2); re-enable this spec in distributed mode once 
those land.
-               if SharedContext.Mode == helpers.ModeDistributed {
-                       g.Skip("§4.6.2 requires cluster-wide propagation 
barrier (Phase 2)")
-               }
                groupName := fmt.Sprintf("clamp-span-%d", time.Now().UnixNano())
                streamName := "clamp_stream"
 
@@ -226,15 +215,12 @@ var _ = g.Describe("Schema time-range clamp", func() {
        // inside [Begin, End] and the datum would leak — proving the clamp is 
actually
        // applied rather than merely consistent with an already-in-range write.
        g.It("clips TimeRange.Begin to max(CreatedAt) and excludes pre-creation 
data (§4.6.4)", func() {
-               // TODO(phase-2): Phase 1 AwaitRevisionApplied is liaison-only 
by design. This spec's
-               // baseline sanity check (Create → AwaitRevision → Write → 
Query expecting HaveLen(1))
-               // races the data-node tsTable readiness in distributed mode. 
The actual clamp
-               // falsification is sound; only the prerequisite Write→Query 
round-trip flakes.
-               // Cluster-wide barrier semantics ship in Phase 2 via 
NodeSchemaStatusService +
-               // liaison fan-out (plan Steps 2.1–2.2); re-enable this spec in 
distributed mode
-               // once those land.
+               // Phase 2.2 barrier ensures schema is on every node, but this 
spec's
+               // baseline sanity step (Create → Write → Query expecting 1 
datum)
+               // still races the data-node write path in distributed mode. 
Re-enable
+               // once Step 2.5 (cluster query gate) lands.
                if SharedContext.Mode == helpers.ModeDistributed {
-                       g.Skip("§4.6.4 requires cluster-wide propagation 
barrier (Phase 2)")
+                       g.Skip("§4.6.4 requires the cluster-wide query gate 
(Phase 2 Step 2.5)")
                }
                group1 := fmt.Sprintf("clamp-leak1-%d", time.Now().UnixNano())
                group2 := fmt.Sprintf("clamp-leak2-%d", time.Now().UnixNano())
diff --git a/test/cases/schema/shape_break.go b/test/cases/schema/shape_break.go
index 192bad7ab..11693fee0 100644
--- a/test/cases/schema/shape_break.go
+++ b/test/cases/schema/shape_break.go
@@ -160,14 +160,12 @@ var _ = g.Describe("Schema shape-break rejection", func() 
{
 
        // §6.8: shape-break — delete+apply new shape creates the new measure 
(Rule 7 clamp end-to-end).
        g.It("shape-break: delete+apply new shape creates the new measure 
(§6.8)", func() {
-               // TODO(phase-2): Phase 1 AwaitRevisionApplied is liaison-only 
by design. Both this spec
-               // and §6.11 perform an end-to-end data Write+Query round-trip 
through the liaison after
-               // a schema mutation; in distributed mode the data node can lag 
the liaison briefly on
-               // tsTable readiness or query-side index refresh, racing the 
immediate Write→Query.
-               // Cluster-wide barrier semantics ship in Phase 2 via 
NodeSchemaStatusService + liaison
-               // fan-out (plan Steps 2.1–2.2); re-enable this spec in 
distributed mode once those land.
+               // Phase 2.2 cluster barrier confirms schema propagation across 
nodes,
+               // but this spec's Write→Query baseline races the data-node 
write path
+               // independently of the schema barrier. Re-enable in 
distributed mode
+               // once Step 2.5 (cluster query gate) lands.
                if SharedContext.Mode == helpers.ModeDistributed {
-                       g.Skip("§6.8 requires cluster-wide propagation barrier 
(Phase 2)")
+                       g.Skip("§6.8 requires the cluster-wide query gate 
(Phase 2 Step 2.5)")
                }
                groupName := fmt.Sprintf("sb-new-%d", time.Now().UnixNano())
                measureName := "throughput"
@@ -410,14 +408,11 @@ var _ = g.Describe("Schema shape-break rejection", func() 
{
 
        // §6.11: delete-then-recreate original shape drops old data (Rule 7 
clamp).
        g.It("delete-then-recreate original shape drops old data (§6.11)", 
func() {
-               // TODO(phase-2): Phase 1 AwaitRevisionApplied is liaison-only 
by design — it confirms
-               // the liaison cache observes R2 but not that every data node 
has rebuilt the tsTable
-               // after the delete-then-recreate. The post-recreate 
Write+Query round-trip exercised
-               // by this spec races the data-node tsTable rebuild on slow CI 
runners. Cluster-wide
-               // barrier semantics ship in Phase 2 via 
NodeSchemaStatusService + liaison fan-out
-               // (plan Steps 2.1–2.2); re-enable this spec in distributed 
mode once those land.
+               // Same write→query race as §6.8 — Phase 2.2's barrier ensures 
schema
+               // coherence but the post-recreate Write→Query baseline still 
flakes
+               // without the cluster query gate. Re-enable once Step 2.5 
lands.
                if SharedContext.Mode == helpers.ModeDistributed {
-                       g.Skip("§6.11 requires cluster-wide propagation barrier 
(Phase 2)")
+                       g.Skip("§6.11 requires the cluster-wide query gate 
(Phase 2 Step 2.5)")
                }
                groupName := fmt.Sprintf("sb-same-%d", time.Now().UnixNano())
                measureName := "throughput"

Reply via email to