hanahmily opened a new pull request, #1112:
URL: https://github.com/apache/skywalking-banyandb/pull/1112

   > **DRAFT — for CP-5 contract review only.** Per the workflow, this PR will 
not be marked ready until CP-5 reviewers (dev lead + SRE) sign off the 
cluster-barrier contract. Two commits at the bottom (`3588e39c`, `3f7a6c75`) 
are also in the in-flight per-step PR #1111; merging is blocked on #1111 
anyway. Opening this PR is purely to surface the cumulative Steps 2.2 + 2.3 + 
2.4 surface for Copilot + human review in one place.
   
   ### Copilot, please focus this review on:
   
   **1. Bugs and correctness issues** in the cluster-barrier fan-out:
   - `banyand/liaison/grpc/barrier_cluster.go` — frozen-snapshot membership 
semantics (`applyTransitions` / `applyTransitionsApplied`), per-iteration probe 
loops, deadline propagation, cross-version `Unimplemented → ready` handling, 
transient-error handling.
   - The three `awaitXxxCluster` paths (`awaitRevisionAppliedCluster`, 
`awaitSchemaAppliedCluster`, `awaitSchemaDeletedCluster`) — race conditions, 
off-by-one in chunking, missed laggard reports, premature applied=true returns.
   - `evictedLaggardReason` handling: is the laggard surfaced exactly once? 
Does `lastRev` carry the correct mod_revision into the laggard's 
`current_mod_revision`?
   - Self-vs-peer dispatch in `probeOne` / `probeKeysOne` / `probeAbsentOne`: 
fail-closed on nil cache, correct tier client lookup for `roleData` vs 
`roleLiaison`.
   - The `NodeLaggard.reason` proto field is additive (field 5); is the regen 
consistent across `barrier.pb.go` and the docs?
   - `test/cases/schema/{shape_break,clamp}.go`: the §4.6.2 unblock and the 
§6.8/§6.11/§4.6.4 re-skip — are the comments accurate about what races vs. what 
the cluster barrier covers?
   
   **2. Performance concerns** under realistic load:
   - HTTP/2 stream contention: every per-member probe rides on `pub`'s existing 
tier1/tier2 `*grpc.ClientConn`. Could the per-iteration probe burst (≤10 polls 
× N peers in a 5s budget) starve concurrent write/query traffic on the same 
connection?
   - Goroutine cost per iteration: `probeMembers` / `probeKeyRevisions` / 
`probeAbsentKeys` spawn one goroutine per member per iteration. On a 50-node 
cluster running, say, 100 concurrent barrier calls, that's 5000 goroutines per 
iteration. Bounded? Pool-able? Backpressure?
   - Chunking memory: `barrierKeyChunkSize=1000`, `barrierMaxKeys=10000` means 
up to 10 chunks per peer per iteration. The accumulator slice (`missing`, 
`present`, `stillPresent`) reallocates on append — is preallocation worth it 
given chunk count is small?
   - `currentMembership()` runs every iteration and calls `GetRouteTable()` on 
both tiers, which clones the route table internally. At ≤10 polls per call, 
this is ≤20 GetRouteTable calls per call. Acceptable, but could be cached.
   - Backoff cadence: 10ms → ×1.5 → cap 500ms gives ≤10 probes per 5s budget 
per member. The §5.0 sequence diagram math holds; flagging just in case Copilot 
sees a tighter pattern.
   
   ### What's in the branch
   
   | Commit | Step | Acceptance |
   |---|---|---|
   | `3588e39c` | Step 2.2 fan-out (`AwaitRevisionApplied`) | A11, A15 cluster |
   | `3f7a6c75` | #1111 fix: `selfName` closure nil-safety | (Copilot review 
tweak) |
   | `87d2723c` | §SS-1..§SS-4: `NodeLaggard.reason` proto + mid-call eviction 
/ leave / late-join | (frozen-snapshot semantics for Steps 2.2 / 2.3 / 2.4) |
   | `1d789975` | §FA-1, §FA-2: `AwaitSchemaApplied` cluster fan-out + chunking 
+ shared deadline | A12 cluster |
   | `6731df5c` | §FD-1, §FD-2: `AwaitSchemaDeleted` cluster fan-out | A13 
cluster |
   | `47006561` | §RE-1: re-enable §4.6.2 in distributed; defer 
§6.8/§6.11/§4.6.4 to Step 2.5 | (test re-enable; first attempt per plan) |
   
   ### What's deferred (NOT in this PR)
   
   - §6.12a/b/c/d distributed integration specs (depend on `PauseDataNodeWatch` 
from Step 1.0; not implemented).
   - Step 2.5 cluster query gate; Step 2.7 metrics; Step 2.8 verification 
harness.
   
   ### Local CI status
   
   All phases green on this branch HEAD (`47006561`):
   - `make license-check / check-req / build / lint / check`
   - `make test-ci PKG=./banyand/...` (29 cluster-fan-out tests + Phase 1 
helpers green)
   - `make test-ci PKG=./bydbctl/...`, `./pkg/...`, `./fodc/...` (one fodc 
Prometheus port-2121 flake recovered on retry; unrelated to this branch)
   - `make test-ci PKG=./test/integration/standalone/...` and 
`./test/integration/distributed/...` (§4.6.2 now passes in distributed mode)
   
   ### Checklist (per `.github/PULL_REQUEST_TEMPLATE`)
   
   - [x] Non-trivial feature; design lives in 
`.omc/plans/banyandb-schema-tbd-plan.md` Step 2.2 / 2.3 / 2.4 (worktree-local).
   - [x] Documentation: function-level comments in `barrier_cluster.go`; 
`NodeLaggard.reason` field documented in proto; `docs/api-reference.md` 
regenerated.
   - [x] Tests: 14 new cluster-fan-out unit tests; existing 19 barrier tests 
stay green; distributed integration green with §4.6.2 re-enabled.
   - [ ] UI-related — N/A.
   - [ ] Closes/resolves/fixes existing issue — N/A; tracking continues under 
#1091's umbrella.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to