s0nskar commented on code in PR #2629:
URL: https://github.com/apache/celeborn/pull/2629#discussion_r1696738331
##########
common/src/main/java/org/apache/celeborn/common/client/MasterClient.java:
##########
@@ -226,14 +225,23 @@ private void resetRpcEndpointRef(@Nullable RpcEndpointRef
oldRef) {
* cannot be obtained.
* @return non-empty RpcEndpointRef.
*/
- private RpcEndpointRef getOrSetupRpcEndpointRef(AtomicInteger currentIndex) {
+ private RpcEndpointRef getOrSetupRpcEndpointRef(AtomicInteger currentIndex,
int currentAttempt) {
RpcEndpointRef endpointRef = rpcEndpointRef.get();
+
+ List<String> activeMasterEndpoints =
masterEndpointResolver.getActiveMasterEndpoints();
+ // If endpoints are updated by MasterEndpointResolver, we should reset the
currentIndex to 0.
+ // This also unset the value of updated, so we don't always reset
currentIndex to 0.
+ if (masterEndpointResolver.getUpdatedAndReset()) {
+ currentIndex.set(0);
+ maxRetries = Math.max(maxRetries, currentAttempt +
activeMasterEndpoints.size());
Review Comment:
Currently maxRetries is set to `maxRetries = max(masterEndpoints.size(),
conf.masterClientMaxRetries())` which means client wants to try all the
available master endpoint atleast once.
Let's say if we only have one attempt remaining to connect with master and
we get a fresh list of master endpoints from resolver IMO we should try all of
those atleast once to keep the behaviour almost same. That why i made this
change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]