Re: [PR] [CELEBORN-2166] Fastfail reduce stage if shuffle data is lost because of worker lost [celeborn]

via GitHub Sat, 18 Oct 2025 01:28:30 -0700


s0nskar commented on code in PR #3496:
URL: https://github.com/apache/celeborn/pull/3496#discussion_r2419194664



##########
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala:
##########
@@ -136,10 +136,24 @@ class ReducePartitionCommitHandler(
     if (mockShuffleLost) {
       mockShuffleLostShuffle == shuffleId
     } else {
-      dataLostShuffleSet.contains(shuffleId)
+      dataLostShuffleSet.contains(shuffleId) || 
isStageDataLostInUnknownWorker(shuffleId)
     }
   }
 
+  private def isStageDataLostInUnknownWorker(shuffleId: Int): Boolean = {
+    if (conf.clientShuffleDataLostOnUnknownWorkerEnabled && 
!conf.clientPushReplicateEnabled) {

Review Comment:
   I can extend this to support `clientPushReplicateEnabled` as well, incase 
both primary and replica ends up on unknown workers?



##########
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:
##########
@@ -593,7 +593,7 @@ class LifecycleManager(val appUniqueId: String, val conf: 
CelebornConf) extends
                 e)
               connectFailedWorkers.put(
                 workerInfo,
-                (StatusCode.WORKER_UNKNOWN, System.currentTimeMillis()))
+                (StatusCode.WORKER_UNRESPONSIVE, System.currentTimeMillis()))

Review Comment:
   IMO `WORKER_UNKNOWN` is conflicting here. `WorkerStatusTracker` marks 
workers unknown on the basis of heartbeat response `res.unknownWorkers`, which 
means worker is not registered in the master cluster. LifecycleManager should 
not have right to do this and should consult master by sending this worker in 
`org.apache.celeborn.client.WorkerStatusTracker#getNeedCheckedWorkers`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-2166] Fastfail reduce stage if shuffle data is lost because of worker lost [celeborn]

Reply via email to