r7raul1984 commented on code in PR #3517:
URL: https://github.com/apache/celeborn/pull/3517#discussion_r2476483235


##########
client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java:
##########
@@ -479,6 +481,7 @@ private PartitionReader createReaderWithRetry(
                 fetchChunkMaxRetry,
                 location,
                 e);
+            location = location.getPeer();

Review Comment:
   After reviewing the code more carefully, I think the issue might not be in 
those areas. Maybe the problem is as follows:
   
   In CelebornShuffleReader, when building makeOpenStreamList, suppose:
   
   workerRequestMap uses hostAndFetchPort as the key.
   There are two workers(hostAndFetchPort): h1:p1 and h1:p2.
   
   Before down scale by kubectl scale:
   
   shuffleClient.getDataClientFactory().createClient(h1, p1) succeeds.
   So in CelebornShuffleReader A, workerRequestMap contains an entry for h1:p1.
   
   After down scale by kubectl scale:
   
   The worker h1:p1 is removed.
   In another CelebornShuffleReader B, trying to create a client for h1:p1 
fails.
   As a result, h1:p1 is added to the excluded worker list.
   
   However, in CelebornShuffleReader A, workerRequestMap still contains h1:p1, 
so it keeps retrying the removed worker, causing unnecessary retries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to