weixiuli commented on issue #26206: [SPARK-29551][CORE] Fix a bug about fetch failed when an executor is … URL: https://github.com/apache/spark/pull/26206#issuecomment-545234149 When an executor lost with some reason (eg:. the external shuffle service or host lost on the executor's host ), while the executor loses time happens to be reduce stage `fetch failed` from it which is really bad, the previous only call `mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress)` to mark one map as broken in the map stage at this time , but other maps on the executor are also not available which can only be resubmitted by a nest retry stage which is the regression. So we should distinguish the failedEpoch of 'executor lost' from the fetchFailedEpochof 'fetch failed' to solve the above problem. As we all know that the previous will call `mapOutputTracker.removeOutputsOnHost(host) `or `mapOutputTracker.removeOutputsOnExecutor(execId) ` when reduce stage fetches failed and the executor is active, while it does NOT for the above problems. @cloud-fan
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
