xiangfu0 opened a new pull request, #17403:
URL: https://github.com/apache/pinot/pull/17403

   # Observation
   repeated curl `/debug/routingTable/<table>` always returns the same subset 
of servers even though queries can be evenly distributed.
   
   The table is using `strictReplicaGroup` as routing strategy and table has 2 
replicas.
   
   # Root cause 
   When a table uses `strictReplicaGroup`, the broker chooses a single 
replica-group based on requestId (e.g. `instanceIdx = requestId % 
numCandidates`, and the `numCandidates` is 2). This is how Pinot rotates across 
replica-groups.
   
   However, the broker debug endpoint `/debug/routingTable/{tableName}` was 
generating a new requestId for each table-type it tried (`OFFLINE` then 
`REALTIME`). For a realtime-only table called via the raw name (no _REALTIME 
suffix), the OFFLINE routing call returns null but still consumes the first 
requestId. That means the REALTIME routing calculation always sees requestId 
values spaced by 2 (1,3,5,…) which, for an even number of replica-groups (most 
commonly 2), always maps to the same replica-group index. 
   
   
   This PR fixes the skew by generating a single requestId per 
`/debug/routingTable` request and reusing it for both `OFFLINE` and `REALTIME` 
routing computations.
   
   #Tests
   
   Adds `PinotBrokerDebugTest` to verify:
   OFFLINE + REALTIME routing in one call use the same requestId
   realtime-only raw table calls don’t “skew” REALTIME requestId (REALTIME ids 
advance by 1 per call, not 2)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to