mridulm commented on code in PR #2609:
URL: https://github.com/apache/celeborn/pull/2609#discussion_r1673015879
##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -136,6 +136,11 @@ public static int celebornShuffleId(
}
}
+ public static int getMapAttemptNumber(TaskContext context) {
+ assert (context.stageAttemptNumber() < (1 << 15) &&
context.attemptNumber() < (1 << 16));
+ return (context.stageAttemptNumber() << 16) | context.attemptNumber();
+ }
+
Review Comment:
Do we have a reproducible test where the issue is surfaced ? That is, some
test which fails and is expected to work ? We can explore what the fix is once
we have a reproducible test.
The test in this PR is only checking for
`spark.stage.maxConsecutiveAttempts` being below 32k - and I am not sure what
is the general behavior expected in the test for #2611 is.
To add, when throwsFetchFailure is enabled, for each stage-attempt, a
different celeborn shuffle id is used (all of which map to the same spark
shuffle id). The source of confusion could be the use of 'shuffleId' name for
the variable - within celeborn it need not map to spark's shuffle id.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]