gaoyajun02 commented on code in PR #56559:
URL: https://github.com/apache/spark/pull/56559#discussion_r3433858778


##########
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##########
@@ -105,6 +105,28 @@ private class ShuffleStatus(
    */
   private[spark] val checksumMismatchIndices: Set[Int] = Set()
 
+  /**
+   * Set of stale mapIds for this shuffle. When task retry or speculation 
causes multiple
+   * attempts for the same map output to push, the merger may include data 
from a stale attempt.
+   * We record the stale mapIds here so the reduce side can check chunkBitmaps 
and fallback
+   * if stale data is present in a merged block.
+   */
+  private[this] val staleMapIds = new java.util.HashSet[Int]()
+
+  /**
+   * Mark a map output as having stale (redundant) push attempts. Called from 
TaskSetManager when it
+   * detects that multiple task attempts for the same map output pushed data 
to the merger.
+   * @param staleMapId the mapId of the stale (redundant) attempt
+   */
+  def markStalePushedMap(staleMapId: Int): Unit = withWriteLock {

Review Comment:
   **Agreed rename plan:**
   
   The codebase already uses `staleMapIndexes` / `markStalePushedPartition` / 
`getStaleMapIndexes` in most places. The remaining inconsistencies are:
   
   1. **`MapOutputTrackerWorker`**: `staleMapIds` → should be `staleMapIndexes` 
(for consistency with `ShuffleStatus` and `PushBasedFetchHelper`)
   2. **`TaskSetManager` log line 967**: `mapId=${mapStatus.mapId}` → should be 
`partitionId=$partitionId` (log what we actually mark, not the MapStatus field)
   3. **Test names / comments**: any remaining `*MapId*` references to the 
stale set should use `*PartitionId*` or `*MapIndex*`
   
   We'll clean these up to make the invariant obvious and prevent future 
"corrections" that break the bitmap match. Thanks for catching this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to