waitinfuture commented on code in PR #2609:
URL: https://github.com/apache/celeborn/pull/2609#discussion_r1676769675
##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -136,6 +136,11 @@ public static int celebornShuffleId(
}
}
+ public static int getMapAttemptNumber(TaskContext context) {
+ assert (context.stageAttemptNumber() < (1 << 15) &&
context.attemptNumber() < (1 << 16));
+ return (context.stageAttemptNumber() << 16) | context.attemptNumber();
+ }
+
Review Comment:
> Resubmit stage and reuse celeborn shuffle id
There are at least two valid MapperEnds with same shuffleId, mapId, and
taskAttemtNumber
I think the second condition is not so strict. Celeborn client does not
guarantee that two task attempts generate the same data batch sequences (means
the two sequences have the same number of batches, and each batch pair contains
the same data). Currently Celeborn uses (mapId, attemptId) to distinguish from
different task attempts, where `mapId = taskContext.partitionId();` and
`attemptId=context.attemptNumber()`. When a determinate stage rerun happens,
the rerun task has the same mapId and attemptNumber(for example 0) as the
previous failed attempt, if the previous one has pushed some data, then it may
cause incorrect result.
cc @jiang13021 @mridulm @RexXiong
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]