mridulm commented on code in PR #2639:
URL: https://github.com/apache/celeborn/pull/2639#discussion_r1694588299


##########
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:
##########
@@ -913,6 +928,54 @@ class LifecycleManager(val appUniqueId: String, val conf: 
CelebornConf) extends
     context.reply(pbReportShuffleFetchFailureResponse)
   }
 
+  private def handleReportBarrierStageAttemptFailure(
+      context: RpcCallContext,
+      appShuffleId: Int,
+      appShuffleIdentifier: String): Unit = {
+
+    val shuffleIds = shuffleIdMapping.get(appShuffleId)
+    if (shuffleIds == null) {
+      throw new UnsupportedOperationException(s"unexpected! unknown 
appShuffleId $appShuffleId")
+    }
+    var ret = true
+    shuffleIds.synchronized {
+      shuffleIds.get(appShuffleIdentifier) match {
+        case Some((shuffleId, true)) =>
+          ret = invokeAppShuffleTrackerCallback(appShuffleId)

Review Comment:
   ~~From a general flow You are right, now that we are only supporting this 
when fetch failure is `throwsFetchFailure` enabled, we dont need it : it was 
necessary in the earlier iteration when `throwsFetchFailure` could be `false` 
as well.~~
   
   Thinking more, we need to do when before the stage attempt gets submitted, 
not after.
   When spark does it, this gets handled properly - and so the shuffle epoch 
that gets propagated to tasks will be after the unregister has completed.
   
   When Celeborn invokes `PbGetShuffleId`, that is after the tasks have already 
started - and so results in updating the epoch post task start.
   
   This will result in being a much more involved change - and while there 
might be a narrow patch to ensuring it still works; this is subject to 
arbitrary change, and it would be better to rely on more well defined behaviors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to