s0nskar commented on code in PR #3496:
URL: https://github.com/apache/celeborn/pull/3496#discussion_r2419194664
##########
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala:
##########
@@ -136,10 +136,24 @@ class ReducePartitionCommitHandler(
if (mockShuffleLost) {
mockShuffleLostShuffle == shuffleId
} else {
- dataLostShuffleSet.contains(shuffleId)
+ dataLostShuffleSet.contains(shuffleId) ||
isStageDataLostInUnknownWorker(shuffleId)
}
}
+ private def isStageDataLostInUnknownWorker(shuffleId: Int): Boolean = {
+ if (conf.clientShuffleDataLostOnUnknownWorkerEnabled &&
!conf.clientPushReplicateEnabled) {
Review Comment:
I can extend this to support `clientPushReplicateEnabled` as well, incase
both primary and replica ends up on unknown workers?
##########
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:
##########
@@ -593,7 +593,7 @@ class LifecycleManager(val appUniqueId: String, val conf:
CelebornConf) extends
e)
connectFailedWorkers.put(
workerInfo,
- (StatusCode.WORKER_UNKNOWN, System.currentTimeMillis()))
+ (StatusCode.WORKER_UNRESPONSIVE, System.currentTimeMillis()))
Review Comment:
IMO `WORKER_UNKNOWN` is conflicting here. `WorkerStatusTracker` marks
workers unknown on the basis of heartbeat response `res.unknownWorkers`, which
means worker is not registered in the master cluster. LifecycleManager should
not have right to do this and should consult master by sending this worker in
`org.apache.celeborn.client.WorkerStatusTracker#getNeedCheckedWorkers`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]