xiangfu0 opened a new pull request, #17403:
URL: https://github.com/apache/pinot/pull/17403
# Observation
repeated curl `/debug/routingTable/<table>` always returns the same subset
of servers even though queries can be evenly distributed.
The table is using `strictReplicaGroup` as routing strategy and table has 2
replicas.
# Root cause
When a table uses `strictReplicaGroup`, the broker chooses a single
replica-group based on requestId (e.g. `instanceIdx = requestId %
numCandidates`, and the `numCandidates` is 2). This is how Pinot rotates across
replica-groups.
However, the broker debug endpoint `/debug/routingTable/{tableName}` was
generating a new requestId for each table-type it tried (`OFFLINE` then
`REALTIME`). For a realtime-only table called via the raw name (no _REALTIME
suffix), the OFFLINE routing call returns null but still consumes the first
requestId. That means the REALTIME routing calculation always sees requestId
values spaced by 2 (1,3,5,…) which, for an even number of replica-groups (most
commonly 2), always maps to the same replica-group index.
This PR fixes the skew by generating a single requestId per
`/debug/routingTable` request and reusing it for both `OFFLINE` and `REALTIME`
routing computations.
#Tests
Adds `PinotBrokerDebugTest` to verify:
OFFLINE + REALTIME routing in one call use the same requestId
realtime-only raw table calls don’t “skew” REALTIME requestId (REALTIME ids
advance by 1 per call, not 2)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]