venkata91 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r657503558
##########
File path:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -156,26 +157,31 @@ private AppShufflePartitionInfo
getOrCreateAppShufflePartitionInfo(
@VisibleForTesting
AppShufflePartitionInfo newAppShufflePartitionInfo(
AppShuffleId appShuffleId,
+ int shuffleSequenceId,
int reduceId,
File dataFile,
File indexFile,
File metaFile) throws IOException {
- return new AppShufflePartitionInfo(appShuffleId, reduceId, dataFile,
+ return new AppShufflePartitionInfo(appShuffleId, shuffleSequenceId,
reduceId, dataFile,
new MergeShuffleFile(indexFile), new MergeShuffleFile(metaFile));
}
@Override
- public MergedBlockMeta getMergedBlockMeta(String appId, int shuffleId, int
reduceId) {
+ public MergedBlockMeta getMergedBlockMeta(
+ String appId,
+ int shuffleId,
+ int shuffleSequenceId,
+ int reduceId) {
AppShuffleId appShuffleId = new AppShuffleId(appId, shuffleId);
- File indexFile = getMergedShuffleIndexFile(appShuffleId, reduceId);
+ File indexFile = getMergedShuffleIndexFile(appShuffleId,
shuffleSequenceId, reduceId);
Review comment:
After having offline discussion with @otterc and @mridulm, looks like in
the case of a stage re-execution for an indeterminate stage, then all the
active jobs which has a chain to this mapStage are rolled back and resubmitted
again.
Therefore, we shouldn't hit a case where the old `shuffleSequenceId` data is
read for a shuffle. So in terms of correctness, either sending the
`shuffleSequenceId` through fetch protocol or keeping the mapping from
`<shuffle-id -> <shuffleSequenceId` tracking the latest finalized attempt both
should work.
I prefer passing the `shuffleSequenceId` through the `fetch` protocol would
keep it simpler but @otterc prefers to not to change the `fetch` protocol as it
is not clear when would old `shuffleSequenceId` can be fetched (for that does
renaming this variable or adding a comment would help?).
Looking for suggestions if others have any other perspectives, once we agree
on one of the solution will go ahead and make that change. cc @Victsm @otterc
@mridulm @Ngone51 @tgravescs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]