RexXiong commented on code in PR #2629:
URL: https://github.com/apache/celeborn/pull/2629#discussion_r1697987581


##########
common/src/main/java/org/apache/celeborn/common/client/MasterClient.java:
##########
@@ -226,14 +225,23 @@ private void resetRpcEndpointRef(@Nullable RpcEndpointRef 
oldRef) {
    *     cannot be obtained.
    * @return non-empty RpcEndpointRef.
    */
-  private RpcEndpointRef getOrSetupRpcEndpointRef(AtomicInteger currentIndex) {
+  private RpcEndpointRef getOrSetupRpcEndpointRef(AtomicInteger currentIndex, 
int currentAttempt) {
     RpcEndpointRef endpointRef = rpcEndpointRef.get();
+
+    List<String> activeMasterEndpoints = 
masterEndpointResolver.getActiveMasterEndpoints();
+    // If endpoints are updated by MasterEndpointResolver, we should reset the 
currentIndex to 0.
+    // This also unset the value of updated, so we don't always reset 
currentIndex to 0.
+    if (masterEndpointResolver.getUpdatedAndReset()) {
+      currentIndex.set(0);
+      maxRetries = Math.max(maxRetries, currentAttempt + 
activeMasterEndpoints.size());

Review Comment:
   IMO, it would be better to retain maxRetries as is. If we change maxRetries 
to a larger value, subsequent retries will use the increased number of 
attempts, which will alter the original behavior. For example:
   Time 1.maxRetries=3
   Time 2.retry when currentAttempt=2, activeMasterEndpoints=3, then 
maxRetries=5
   Time 3.retry when currentAttempt=4, activeMasterEndpoints=3, then 
maxRetries=7
   Time 4.retry when currentAttempt=6, activeMasterEndpoints=3, then 
maxRetries=9
   ....
    
   If the resolver continues to change, the number of maxRetries might increase 
significantly, leading to exceedingly high retry attempts in subsequent 
iterations.
   
   And for every endpoint updates, we should never exceed the maxRetries. I 
think we can use a temp variable in while loop, and check endpoints whether 
updated before `getOrSetupRpcEndpointRef`, then we can decide increase the temp 
variable.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to