J-HowHuang opened a new pull request, #17138:
URL: https://github.com/apache/pinot/pull/17138

   ## Description
   The greedy algorithm to re-assign segments according to their current 
assignment was introduced in this PR: https://github.com/apache/pinot/pull/16135
   
   The algorithm didn't account for the following scenario: 
   ```
   3 servers, 3 replicas:
   
   +----------+----------+----------+
   | Server 1 | Server 2 | Server 3 |
   +----------+----------+----------+
   |  seg_1   |  seg_1   |  seg_1   |
   |  seg_2   |  seg_2   |  seg_2   |
   |  seg_3   |  seg_3   |  seg_3   |
   |  seg_4   |  seg_4   |  seg_4   |
   |  seg_5   |  seg_5   |  seg_5   |
   |  seg_6   |  seg_6   |  seg_6   |
   +----------+----------+----------+
   ```
   
   Now the replication is reduced to 2. For each segment 1-6, the algorithm is 
going to choose the first two servers by the default order (I'm not sure if 
it's alphabetical but I assume) from the current assignment ideal state, until 
the server is assigned with 4 segments (6 segments * 2 replicas / 3 servers). 
The default first two are server 1 and 2, and is identical across all segments. 
   
   The algorithm will assign the segment like this initially:
   
   ```
   +----------+----------+----------+
   | Server 1 | Server 2 | Server 3 |
   +----------+----------+----------+
   |  seg_1   |  seg_1   |          |
   |  seg_2   |  seg_2   |          |
   |  seg_3   |  seg_3   |          |
   |  seg_4   |  seg_4   |          |
   |          |          |          |
   |          |          |          |
   +----------+----------+----------+
   ```
   
   Then, for segment 5 and 6, because the first two servers in the default 
order are assigned 4 already, it's added to the third server in the list
   ```
   +----------+----------+----------+
   | Server 1 | Server 2 | Server 3 |
   +----------+----------+----------+
   |  seg_1   |  seg_1   |  seg_5   |
   |  seg_2   |  seg_2   |  seg_6   |
   |  seg_3   |  seg_3   |          |
   |  seg_4   |  seg_4   |          |
   |          |          |          |
   |          |          |          |
   +----------+----------+----------+
   ```
   Now we have one replica of each of segment 5 and 6 left, because they cannot 
be assigned to server 3 again. The algorithm then assign them to the server 
hosting least segments
   ```
   +----------+----------+----------+
   | Server 1 | Server 2 | Server 3 |
   +----------+----------+----------+
   |  seg_1   |  seg_1   |  seg_5   |
   |  seg_2   |  seg_2   |  seg_6   |
   |  seg_3   |  seg_3   |          |
   |  seg_4   |  seg_4   |          |
   |  seg_5   |  seg_6   |          |
   |          |          |          |
   +----------+----------+----------+
   ```
   
   This is why the following rebalance summary was seen:
   ```
   {
     "numServersGettingNewSegments": 0,
     "numServers": {
       "valueBeforeRebalance": 3,
       "expectedValueAfterRebalance": 3
     },
     "serversAdded": [],
     "serversRemoved": [],
     "serversUnchanged": [
       "Server_pinot-server-server-0-5",
       "Server_pinot-server-server-0-4",
       "Server_pinot-server-server-0-6"
     ],
     "serversGettingNewSegments": [],
     "serverSegmentChangeInfo": {
       "Server_pinot-server-server-0-5": {
         "serverStatus": "UNCHANGED",
         "totalSegmentsAfterRebalance": 1025, <----------------- uneven
         "totalSegmentsBeforeRebalance": 1230,
         "segmentsAdded": 0,
         "segmentsDeleted": 205,
         "segmentsUnchanged": 1025,
         "tagList": [
           "DefaultTenant_REALTIME",
           "DefaultTenant_OFFLINE"
         ]
       },
       "Server_pinot-server-server-0-4": {
         "serverStatus": "UNCHANGED",
         "totalSegmentsAfterRebalance": 1025, <----------------- uneven
         "totalSegmentsBeforeRebalance": 1230,
         "segmentsAdded": 0,
         "segmentsDeleted": 205,
         "segmentsUnchanged": 1025,
         "tagList": [
           "DefaultTenant_OFFLINE",
           "DefaultTenant_REALTIME"
         ]
       },
       "Server_pinot-server-server-0-6": {
         "serverStatus": "UNCHANGED",
         "totalSegmentsAfterRebalance": 410, <----------------- uneven
         "totalSegmentsBeforeRebalance": 1230,
         "segmentsAdded": 0,
         "segmentsDeleted": 820,
         "segmentsUnchanged": 410,
         "tagList": [
           "DefaultTenant_OFFLINE",
           "DefaultTenant_REALTIME"
         ]
       }
     }
   }
   
   {
     "totalSegmentsToBeMoved": 0,
     "totalSegmentsToBeDeleted": 1230,
     "maxSegmentsAddedToASingleServer": 0,
     "estimatedAverageSegmentSizeInBytes": 1926625,
     "totalEstimatedDataToBeMovedInBytes": 0,
     "replicationFactor": {
       "valueBeforeRebalance": 3,
       "expectedValueAfterRebalance": 2  <----------------- we reduce the RF 
here
     },
     "numSegmentsInSingleReplica": {
       "valueBeforeRebalance": 1230,
       "expectedValueAfterRebalance": 1230
     },
     "numSegmentsAcrossAllReplicas": {
       "valueBeforeRebalance": 3690,
       "expectedValueAfterRebalance": 2460
     }
   }
   ```
   
   ## Change
   The algorithm is modified by sorting the candidate servers by their current 
assigned segments count first, then use this order to assign the segment to 
give priority to those server who has fewer segments.
   
   ## Test 
   Add a unit test to address the above scenario. The original algorithm 
would've failed the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to