mridulm commented on code in PR #2639:
URL: https://github.com/apache/celeborn/pull/2639#discussion_r1694588299
##########
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:
##########
@@ -913,6 +928,54 @@ class LifecycleManager(val appUniqueId: String, val conf:
CelebornConf) extends
context.reply(pbReportShuffleFetchFailureResponse)
}
+ private def handleReportBarrierStageAttemptFailure(
+ context: RpcCallContext,
+ appShuffleId: Int,
+ appShuffleIdentifier: String): Unit = {
+
+ val shuffleIds = shuffleIdMapping.get(appShuffleId)
+ if (shuffleIds == null) {
+ throw new UnsupportedOperationException(s"unexpected! unknown
appShuffleId $appShuffleId")
+ }
+ var ret = true
+ shuffleIds.synchronized {
+ shuffleIds.get(appShuffleIdentifier) match {
+ case Some((shuffleId, true)) =>
+ ret = invokeAppShuffleTrackerCallback(appShuffleId)
Review Comment:
~~From a general flow You are right, now that we are only supporting this
when fetch failure is `throwsFetchFailure` enabled, we dont need it : it was
necessary in the earlier iteration when `throwsFetchFailure` could be `false`
as well.~~
Thinking more, we need to do it before the stage attempt gets submitted, not
after.
When spark does it, the shuffle epoch that gets propagated to tasks will be
after the unregister has completed.
When Celeborn invokes `PbGetShuffleId`, which is when we could clear map
status, that will be after the tasks have already started - and so results in
updating the epoch post task start : resulting in mismatch.
The impact of epoch update is more involved and this would also be subject
to arbitrary change as spark evolves.
Given this, it would be better to rely on more well defined behaviors.
Note: I had prototyped removing this - when I realized the epoch issue !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]