hanahmily opened a new pull request, #1132:
URL: https://github.com/apache/skywalking-banyandb/pull/1132

   ## Summary
   
   A production GKE cluster running BanyanDB returns duplicate rows for 
`IndexMode=true` measures: the same `Sid` (series ID) appears two or more times 
with different timestamps. Investigation showed both `data-hot-0` and 
`data-hot-1` materialize bluge segment files 
(`/tmp/measure/data/sw_metadata/seg-*/sidx/*.seg`) for the same logical shard, 
even though the group is configured `shardNum=2, replicas=0`. Example query 
result (`zipkin_service_traffic_minute` in group `sw_metadata`):
   
   ```
   id=app.sample-services       sid=5368413324821367713   
ts=2026-03-17T01:47:00Z
   id=app.sample-services       sid=5368413324821367713   
ts=2026-04-11T06:31:00Z
   id=gateway.sample-services   sid=14768591501448641370  
ts=2026-03-17T01:47:00Z
   id=gateway.sample-services   sid=14768591501448641370  
ts=2026-05-07T03:48:00Z
   …(every service repeats)
   ```
   
   ### Cause chain
   
   1. **Write-side split-brain** — when a liaison evicts a data node from its 
round-robin selector on a transient gRPC `DeadlineExceeded`, `selectNode(i, 
replicaID) = nodes[(i+replicaID) % len(nodes)]` 
(`pkg/node/round_robin.go:219-222`) remaps every shardID to the surviving node. 
A peer liaison may still route the original shardID to the evicted node, so the 
same shard's writes land on **two** data nodes. Each node's `IndexDB().Update` 
(`pkg/index/inverted/inverted_series.go:65-86`) upserts only 
per-segment-per-node, with no cross-node coordination. The write-side window 
has been closed by #1130 (`af5970f0` — `fix(queue/pub): keep pub-node-probe 
alive after caller ctx cancel`).
   
   2. **Read-side dedup gap** — `sortedMIterator.loadDps` in 
`pkg/query/logical/measure/measure_plan_distributed.go` dedupes by
      ```go
      hashDataPoint = fnv(Sid, Timestamp.Seconds, Timestamp.Nanos)
      ```
      only **within rows that share `SortedField()`**. Cross-node duplicates 
carry different per-node "last-write" timestamps, so when the result is sorted 
by time they land in different sort-field groups — the dedup window has already 
closed (`else break`) by the time the second row arrives. Even within one 
group, the hash key differs because `Timestamp` is part of it.
   
      For an `IndexMode=true` measure, the doc's `Timestamp` is bookkeeping 
(which write was last on a given node), not identity. Dedup by `Sid` alone is 
the right contract.
   
   ### Fix
   
   Thread the schema's `IndexMode` flag from `unresolvedDistributed.Analyze` 
through `distributedPlan` into `sortedMIterator`. When set:
   
   - Maintain a `seenSids map[uint64]struct{}` populated across **all** 
`loadOneGroup` invocations (mirroring the per-node `seriesFilter` at 
`banyand/measure/query.go:556-575`).
   - Skip rows whose `Sid` was already emitted in a prior group; otherwise add 
`Sid` to the set and push.
   - Split `loadDps` into a wrapper + `loadOneGroup` so a fully-filtered 
intermediate group advances the loop instead of terminating iteration.
   
   Time-series measures (`IndexMode=false`) keep the existing per-group dedup 
bit-for-bit — `seenSids` is `nil`, the only consumer is gated.
   
   ### Why now, not at write time
   
   The write-side fix (#1130) prevents **new** divergence. Already-divergent 
on-disk state will age out via the group's TTL but is observable to clients 
until then. The read-side fix bounds the user-visible duplicate window to 
"until next segment rotation" regardless of disk state, and is durable against 
any future write-path race.
   
   ### Out of scope
   
   - Cluster upgrade / image deployment (operational).
   - On-disk cleanup of existing duplicate docs (ages out via 60d TTL).
   - `pkg/node/round_robin.go` selector semantics (separate analysis; see #1130 
for the related write-side fix).
   - Vec query path — row-path is being deprecated in favor of vec (see `// 
Deprecated:` at `pkg/query/logical/measure/measure_plan_distributed.go:105`); a 
vec-distributed equivalent is a G8 follow-up.
   
   ## Test plan
   
   - [x] `go build ./...`
   - [x] `go test -count=1 ./pkg/query/logical/measure/...` (6 pre-existing + 5 
new IndexMode dedup cases, all pass)
   - [x] `bin/golangci-lint run ./pkg/query/logical/measure/...` (clean)
   - [ ] Verify on the affected cluster: after deploying both #1130 and this 
PR, query `zipkin_service_traffic_minute` and confirm each `id` returns exactly 
one row.
   
   ### Test coverage added
   
   `TestSortedMIterator_IndexModeDedup` covers:
   
   1. `IndexMode=true`, same `Sid` across different sort-field groups → 
collapses to one (the user-reported bug).
   2. `IndexMode=false`, same input → both rows preserved (regression guard).
   3. `IndexMode=true`, same `Sid` within one sort-field group → collapses to 
one.
   4. `IndexMode=true`, three Sids interleaved across two groups → distinct 
Sids preserved.
   5. `IndexMode=true`, fully-filtered intermediate group → outer loop advances 
to next productive group instead of terminating.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code) via 
[HAPI](https://hapi.run)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to