mridulm commented on code in PR #2609:
URL: https://github.com/apache/celeborn/pull/2609#discussion_r1674668595


##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -136,6 +136,11 @@ public static int celebornShuffleId(
     }
   }
 
+  public static int getMapAttemptNumber(TaskContext context) {
+    assert (context.stageAttemptNumber() < (1 << 15) && 
context.attemptNumber() < (1 << 16));
+    return (context.stageAttemptNumber() << 16) | context.attemptNumber();
+  }
+

Review Comment:
   Thanks for the text, that is really useful @jiang13021 !
   So the issue here essentially seems to be that we are not handling barrier 
mode correctly - essentially, we need something similar to 
`PbReportShuffleFetchFailure` when a barrier taskset fails and is being 
reattempted.
   
   Whether it is determinate or indeterminate, it does not matter - barrier 
stage will be entirely reexecuted always; and so we should essentially throw 
away its entire output when the stage is being reattempted.



##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -136,6 +136,11 @@ public static int celebornShuffleId(
     }
   }
 
+  public static int getMapAttemptNumber(TaskContext context) {
+    assert (context.stageAttemptNumber() < (1 << 15) && 
context.attemptNumber() < (1 << 16));
+    return (context.stageAttemptNumber() << 16) | context.attemptNumber();
+  }
+

Review Comment:
   Thanks for the test, that is really useful @jiang13021 !
   So the issue here essentially seems to be that we are not handling barrier 
mode correctly - essentially, we need something similar to 
`PbReportShuffleFetchFailure` when a barrier taskset fails and is being 
reattempted.
   
   Whether it is determinate or indeterminate, it does not matter - barrier 
stage will be entirely reexecuted always; and so we should essentially throw 
away its entire output when the stage is being reattempted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to