s0nskar commented on code in PR #2629:
URL: https://github.com/apache/celeborn/pull/2629#discussion_r1696738331


##########
common/src/main/java/org/apache/celeborn/common/client/MasterClient.java:
##########
@@ -226,14 +225,23 @@ private void resetRpcEndpointRef(@Nullable RpcEndpointRef 
oldRef) {
    *     cannot be obtained.
    * @return non-empty RpcEndpointRef.
    */
-  private RpcEndpointRef getOrSetupRpcEndpointRef(AtomicInteger currentIndex) {
+  private RpcEndpointRef getOrSetupRpcEndpointRef(AtomicInteger currentIndex, 
int currentAttempt) {
     RpcEndpointRef endpointRef = rpcEndpointRef.get();
+
+    List<String> activeMasterEndpoints = 
masterEndpointResolver.getActiveMasterEndpoints();
+    // If endpoints are updated by MasterEndpointResolver, we should reset the 
currentIndex to 0.
+    // This also unset the value of updated, so we don't always reset 
currentIndex to 0.
+    if (masterEndpointResolver.getUpdatedAndReset()) {
+      currentIndex.set(0);
+      maxRetries = Math.max(maxRetries, currentAttempt + 
activeMasterEndpoints.size());

Review Comment:
   Currently maxRetries is set to `maxRetries = max(masterEndpoints.size(), 
conf.masterClientMaxRetries())` which means client wants to try all the 
available master endpoint atleast once.
   
   Let's say if we only have one attempt remaining to connect with master and 
we get a fresh list of master endpoints from resolver IMO we should try all of 
those atleast once to keep the behaviour almost same. That why i made this 
change.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to