waitinfuture commented on code in PR #2609:
URL: https://github.com/apache/celeborn/pull/2609#discussion_r1676769675


##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -136,6 +136,11 @@ public static int celebornShuffleId(
     }
   }
 
+  public static int getMapAttemptNumber(TaskContext context) {
+    assert (context.stageAttemptNumber() < (1 << 15) && 
context.attemptNumber() < (1 << 16));
+    return (context.stageAttemptNumber() << 16) | context.attemptNumber();
+  }
+

Review Comment:
   > Resubmit stage and reuse celeborn shuffle id
   There are at least two valid MapperEnds with same shuffleId, mapId, and 
taskAttemtNumber
   
   I think the second condition is not so strict. Celeborn client does not 
guarantee that two task attempts generate the same data batch sequences (means 
the two sequences have the same number of batches, and each batch pair contains 
the same data). Currently Celeborn uses (mapId, attemptId) to distinguish from 
different task attempts, where `mapId = taskContext.partitionId();` and 
`attemptId=context.attemptNumber()`. When a determinate stage rerun happens, 
the rerun task has the same mapId and attemptNumber(for example 0) as the 
previous failed attempt, if the previous one has pushed some data, then it may 
cause incorrect result.
   
   cc @jiang13021 @mridulm @RexXiong 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to