RexXiong commented on code in PR #2609:
URL: https://github.com/apache/celeborn/pull/2609#discussion_r1697821776


##########
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java:
##########
@@ -136,6 +136,11 @@ public static int celebornShuffleId(
     }
   }
 
+  public static int getMapAttemptNumber(TaskContext context) {
+    assert (context.stageAttemptNumber() < (1 << 15) && 
context.attemptNumber() < (1 << 16));
+    return (context.stageAttemptNumber() << 16) | context.attemptNumber();
+  }
+

Review Comment:
   > Thanks @RexXiong for explanation, please push your test branch and post 
the link here so that we can review and test. Hi @mridulm , @RexXiong and I 
have discussed on this problem for a while and @RexXiong has re-produced the 
correctness issue for non-barrier stages with minor code change, he will post 
the code soon, hope this will help making our points more clear.
   
   You can checkout the [modified 
code](https://github.com/RexXiong/celeborn/tree/reproduce_deterministic_stage_problem)
 , and change the value of 
`spark.celeborn.test.client.push.mockRandomPushForStageRerun` to check the ut 
`test("celeborn spark integration test - Fetch Failure - Deterministic Stage") 
` result. 
   
   And when change the map attempt number use encoding with stageAttemptNumber. 
the result is correct.
   
   >   public static int getMapAttemptNumber(TaskContext context) {
   >    return context.attemptNumber();
   >    //    assert (context.stageAttemptNumber() < (1 << 15) && 
context.attemptNumber() < (1 << 16));
   >    //    return (context.stageAttemptNumber() << 16) | 
context.attemptNumber();
   >  }
   
   cc @jiang13021 @mridulm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to