venkata91 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r657503558



##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -156,26 +157,31 @@ private AppShufflePartitionInfo 
getOrCreateAppShufflePartitionInfo(
   @VisibleForTesting
   AppShufflePartitionInfo newAppShufflePartitionInfo(
       AppShuffleId appShuffleId,
+      int shuffleSequenceId,
       int reduceId,
       File dataFile,
       File indexFile,
       File metaFile) throws IOException {
-    return new AppShufflePartitionInfo(appShuffleId, reduceId, dataFile,
+    return new AppShufflePartitionInfo(appShuffleId, shuffleSequenceId, 
reduceId, dataFile,
       new MergeShuffleFile(indexFile), new MergeShuffleFile(metaFile));
   }
 
   @Override
-  public MergedBlockMeta getMergedBlockMeta(String appId, int shuffleId, int 
reduceId) {
+  public MergedBlockMeta getMergedBlockMeta(
+      String appId,
+      int shuffleId,
+      int shuffleSequenceId,
+      int reduceId) {
     AppShuffleId appShuffleId = new AppShuffleId(appId, shuffleId);
-    File indexFile = getMergedShuffleIndexFile(appShuffleId, reduceId);
+    File indexFile = getMergedShuffleIndexFile(appShuffleId, 
shuffleSequenceId, reduceId);

Review comment:
       After having offline discussion with @otterc and @mridulm, looks like in 
the case of a stage re-execution for an indeterminate stage, all the active 
jobs which has a chain to this mapStage are rolled back and resubmitted again.
   Therefore, we shouldn't hit a case where the old `shuffleSequenceId` data is 
read for a shuffle. So in terms of correctness, either sending the 
`shuffleSequenceId` through fetch protocol or keeping the mapping from 
`<shuffle-id -> <shuffleSequenceId` on the shuffle service side tracking the 
latest finalized attempt - both should work.
   
   I prefer passing the `shuffleSequenceId` through the `fetch` protocol would 
keep it simpler but @otterc prefers to not to change the `fetch` protocol as it 
is not clear when would old `shuffleSequenceId` can be fetched (for that does 
renaming this variable or adding a comment would help?).
   Looking for suggestions if others have any other perspectives, once we agree 
on one of the solution will go ahead and make that change. cc @Victsm @otterc 
@mridulm @Ngone51 @tgravescs 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to