Ngone51 commented on PR #52606: URL: https://github.com/apache/spark/pull/52606#issuecomment-3489408460
The problem is that, in the case of mllib test failures for example, it uses Dataset.rdd to run Spark jobs. And the execution of Dataset.rdd is also protected by `SQLExecution.withNewExecutionId`. Thus, it would also trigger shuffle cleanup after the rdd execution is done. However, the shuffle data is actually still needed by the reduce rdd. So when the reduce rdd starts running, the issue/error occurs. And the issue/error behaves differently before and after this PR. Before this PR: `MapOutputTrackerMaster.unregisterShuffle(shuffleId)` is called, which clear the single-source-of-truth shuffle statues. And this leads to the rerun of the map rdd due to missing the shuffle statuses. Since the rerun is triggered within the reduce rdd's execution, so the map rdd's shuffle statuses won't be cleaned up at this time. Then, the reduce rdd is able to run successfully. And we'd run into this issue loop for the later on reduce rdds. After this PR, `MapOutputTrackerMaster.clearShuffleStatusCache(shuffleId)` is called instead, and it is a No-Op. But the shuffle index/data files are still removed followed by `shuffleManager.unregisterShuffle(shuffleId)` (this is also invoked before this PR). Since the shuffle statues is not removed, the reduce rdd starts to run but failed to find the shuffle index/data files, results in the shuffle fetch failures. And this^^^ is the issue is local mode. For non-local mode, the difference is that we use `MapOutputTrackerWorker` instead. For `MapOutputTrackerWorker`, the shuffle cleanup behavior remains the same before and after this PR: both clear the cached shuffle statuses on the executor. So, theoretically, the job would also hit the shuffle fetch failures in non-local mode. To fix the issue, I think we should not apply shuffle cleanup when the execution trigger is RDD. WDYT? @cloud-fan @karuppayya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
