hanahmily opened a new pull request, #1132:
URL: https://github.com/apache/skywalking-banyandb/pull/1132
## Summary
A production GKE cluster running BanyanDB returns duplicate rows for
`IndexMode=true` measures: the same `Sid` (series ID) appears two or more times
with different timestamps. Investigation showed both `data-hot-0` and
`data-hot-1` materialize bluge segment files
(`/tmp/measure/data/sw_metadata/seg-*/sidx/*.seg`) for the same logical shard,
even though the group is configured `shardNum=2, replicas=0`. Example query
result (`zipkin_service_traffic_minute` in group `sw_metadata`):
```
id=app.sample-services sid=5368413324821367713
ts=2026-03-17T01:47:00Z
id=app.sample-services sid=5368413324821367713
ts=2026-04-11T06:31:00Z
id=gateway.sample-services sid=14768591501448641370
ts=2026-03-17T01:47:00Z
id=gateway.sample-services sid=14768591501448641370
ts=2026-05-07T03:48:00Z
…(every service repeats)
```
### Cause chain
1. **Write-side split-brain** — when a liaison evicts a data node from its
round-robin selector on a transient gRPC `DeadlineExceeded`, `selectNode(i,
replicaID) = nodes[(i+replicaID) % len(nodes)]`
(`pkg/node/round_robin.go:219-222`) remaps every shardID to the surviving node.
A peer liaison may still route the original shardID to the evicted node, so the
same shard's writes land on **two** data nodes. Each node's `IndexDB().Update`
(`pkg/index/inverted/inverted_series.go:65-86`) upserts only
per-segment-per-node, with no cross-node coordination. The write-side window
has been closed by #1130 (`af5970f0` — `fix(queue/pub): keep pub-node-probe
alive after caller ctx cancel`).
2. **Read-side dedup gap** — `sortedMIterator.loadDps` in
`pkg/query/logical/measure/measure_plan_distributed.go` dedupes by
```go
hashDataPoint = fnv(Sid, Timestamp.Seconds, Timestamp.Nanos)
```
only **within rows that share `SortedField()`**. Cross-node duplicates
carry different per-node "last-write" timestamps, so when the result is sorted
by time they land in different sort-field groups — the dedup window has already
closed (`else break`) by the time the second row arrives. Even within one
group, the hash key differs because `Timestamp` is part of it.
For an `IndexMode=true` measure, the doc's `Timestamp` is bookkeeping
(which write was last on a given node), not identity. Dedup by `Sid` alone is
the right contract.
### Fix
Thread the schema's `IndexMode` flag from `unresolvedDistributed.Analyze`
through `distributedPlan` into `sortedMIterator`. When set:
- Maintain a `seenSids map[uint64]struct{}` populated across **all**
`loadOneGroup` invocations (mirroring the per-node `seriesFilter` at
`banyand/measure/query.go:556-575`).
- Skip rows whose `Sid` was already emitted in a prior group; otherwise add
`Sid` to the set and push.
- Split `loadDps` into a wrapper + `loadOneGroup` so a fully-filtered
intermediate group advances the loop instead of terminating iteration.
Time-series measures (`IndexMode=false`) keep the existing per-group dedup
bit-for-bit — `seenSids` is `nil`, the only consumer is gated.
### Why now, not at write time
The write-side fix (#1130) prevents **new** divergence. Already-divergent
on-disk state will age out via the group's TTL but is observable to clients
until then. The read-side fix bounds the user-visible duplicate window to
"until next segment rotation" regardless of disk state, and is durable against
any future write-path race.
### Out of scope
- Cluster upgrade / image deployment (operational).
- On-disk cleanup of existing duplicate docs (ages out via 60d TTL).
- `pkg/node/round_robin.go` selector semantics (separate analysis; see #1130
for the related write-side fix).
- Vec query path — row-path is being deprecated in favor of vec (see `//
Deprecated:` at `pkg/query/logical/measure/measure_plan_distributed.go:105`); a
vec-distributed equivalent is a G8 follow-up.
## Test plan
- [x] `go build ./...`
- [x] `go test -count=1 ./pkg/query/logical/measure/...` (6 pre-existing + 5
new IndexMode dedup cases, all pass)
- [x] `bin/golangci-lint run ./pkg/query/logical/measure/...` (clean)
- [ ] Verify on the affected cluster: after deploying both #1130 and this
PR, query `zipkin_service_traffic_minute` and confirm each `id` returns exactly
one row.
### Test coverage added
`TestSortedMIterator_IndexModeDedup` covers:
1. `IndexMode=true`, same `Sid` across different sort-field groups →
collapses to one (the user-reported bug).
2. `IndexMode=false`, same input → both rows preserved (regression guard).
3. `IndexMode=true`, same `Sid` within one sort-field group → collapses to
one.
4. `IndexMode=true`, three Sids interleaved across two groups → distinct
Sids preserved.
5. `IndexMode=true`, fully-filtered intermediate group → outer loop advances
to next productive group instead of terminating.
🤖 Generated with [Claude Code](https://claude.com/claude-code) via
[HAPI](https://hapi.run)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]