weixiuli edited a comment on issue #26206: [SPARK-29551][CORE] Fix a bug about fetch failed when an executor is … URL: https://github.com/apache/spark/pull/26206#issuecomment-545234149 When an executor lost with some reason and some things (eg:. the external shuffle service or host lost on the executor's host.) happened, and the executor loses time happens to be reduce stage `fetch failed` from it which is really bad, the previous only call `mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress)` to mark one map as broken in the map stage at this time , but other maps on the executor are also not available which can only be resubmitted by a nest retry stage which is the regression. As we all know that the previous will call `mapOutputTracker.removeOutputsOnHost(host) `or `mapOutputTracker.removeOutputsOnExecutor(execId) ` when reduce stage fetches failed and the executor is active, while it does NOT for the above problems. So we should distinguish the failedEpoch of 'executor lost' from the fetchFailedEpoch of 'fetch failed' to solve the above problem. @cloud-fan
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org