J-HowHuang opened a new pull request, #17138: URL: https://github.com/apache/pinot/pull/17138
## Description The greedy algorithm to re-assign segments according to their current assignment was introduced in this PR: https://github.com/apache/pinot/pull/16135 The algorithm didn't account for the following scenario: ``` 3 servers, 3 replicas: +----------+----------+----------+ | Server 1 | Server 2 | Server 3 | +----------+----------+----------+ | seg_1 | seg_1 | seg_1 | | seg_2 | seg_2 | seg_2 | | seg_3 | seg_3 | seg_3 | | seg_4 | seg_4 | seg_4 | | seg_5 | seg_5 | seg_5 | | seg_6 | seg_6 | seg_6 | +----------+----------+----------+ ``` Now the replication is reduced to 2. For each segment 1-6, the algorithm is going to choose the first two servers by the default order (I'm not sure if it's alphabetical but I assume) from the current assignment ideal state, until the server is assigned with 4 segments (6 segments * 2 replicas / 3 servers). The default first two are server 1 and 2, and is identical across all segments. The algorithm will assign the segment like this initially: ``` +----------+----------+----------+ | Server 1 | Server 2 | Server 3 | +----------+----------+----------+ | seg_1 | seg_1 | | | seg_2 | seg_2 | | | seg_3 | seg_3 | | | seg_4 | seg_4 | | | | | | | | | | +----------+----------+----------+ ``` Then, for segment 5 and 6, because the first two servers in the default order are assigned 4 already, it's added to the third server in the list ``` +----------+----------+----------+ | Server 1 | Server 2 | Server 3 | +----------+----------+----------+ | seg_1 | seg_1 | seg_5 | | seg_2 | seg_2 | seg_6 | | seg_3 | seg_3 | | | seg_4 | seg_4 | | | | | | | | | | +----------+----------+----------+ ``` Now we have one replica of each of segment 5 and 6 left, because they cannot be assigned to server 3 again. The algorithm then assign them to the server hosting least segments ``` +----------+----------+----------+ | Server 1 | Server 2 | Server 3 | +----------+----------+----------+ | seg_1 | seg_1 | seg_5 | | seg_2 | seg_2 | seg_6 | | seg_3 | seg_3 | | | seg_4 | seg_4 | | | seg_5 | seg_6 | | | | | | +----------+----------+----------+ ``` This is why the following rebalance summary was seen: ``` { "numServersGettingNewSegments": 0, "numServers": { "valueBeforeRebalance": 3, "expectedValueAfterRebalance": 3 }, "serversAdded": [], "serversRemoved": [], "serversUnchanged": [ "Server_pinot-server-server-0-5", "Server_pinot-server-server-0-4", "Server_pinot-server-server-0-6" ], "serversGettingNewSegments": [], "serverSegmentChangeInfo": { "Server_pinot-server-server-0-5": { "serverStatus": "UNCHANGED", "totalSegmentsAfterRebalance": 1025, <----------------- uneven "totalSegmentsBeforeRebalance": 1230, "segmentsAdded": 0, "segmentsDeleted": 205, "segmentsUnchanged": 1025, "tagList": [ "DefaultTenant_REALTIME", "DefaultTenant_OFFLINE" ] }, "Server_pinot-server-server-0-4": { "serverStatus": "UNCHANGED", "totalSegmentsAfterRebalance": 1025, <----------------- uneven "totalSegmentsBeforeRebalance": 1230, "segmentsAdded": 0, "segmentsDeleted": 205, "segmentsUnchanged": 1025, "tagList": [ "DefaultTenant_OFFLINE", "DefaultTenant_REALTIME" ] }, "Server_pinot-server-server-0-6": { "serverStatus": "UNCHANGED", "totalSegmentsAfterRebalance": 410, <----------------- uneven "totalSegmentsBeforeRebalance": 1230, "segmentsAdded": 0, "segmentsDeleted": 820, "segmentsUnchanged": 410, "tagList": [ "DefaultTenant_OFFLINE", "DefaultTenant_REALTIME" ] } } } { "totalSegmentsToBeMoved": 0, "totalSegmentsToBeDeleted": 1230, "maxSegmentsAddedToASingleServer": 0, "estimatedAverageSegmentSizeInBytes": 1926625, "totalEstimatedDataToBeMovedInBytes": 0, "replicationFactor": { "valueBeforeRebalance": 3, "expectedValueAfterRebalance": 2 <----------------- we reduce the RF here }, "numSegmentsInSingleReplica": { "valueBeforeRebalance": 1230, "expectedValueAfterRebalance": 1230 }, "numSegmentsAcrossAllReplicas": { "valueBeforeRebalance": 3690, "expectedValueAfterRebalance": 2460 } } ``` ## Change The algorithm is modified by sorting the candidate servers by their current assigned segments count first, then use this order to assign the segment to give priority to those server who has fewer segments. ## Test Add a unit test to address the above scenario. The original algorithm would've failed the test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
