weixiuli edited a comment on issue #26206: [SPARK-29551][CORE] Fix a bug about 
fetch failed when an executor is …
URL: https://github.com/apache/spark/pull/26206#issuecomment-545234149
 
 
   When  an executor lost with some reason and some things (eg:. the external 
shuffle service  or  host lost on the executor's host.) happened,  and  the 
executor loses time happens to be reduce stage  `fetch failed`  from it which 
is really bad, the previous only call  
`mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress)` to mark 
one map as broken in the map stage  at this time , but other maps on the 
executor are also not available which can only be resubmitted by a nest retry 
stage which is the regression.  
   
   As we all know that the previous will call 
`mapOutputTracker.removeOutputsOnHost(host) `or 
   `mapOutputTracker.removeOutputsOnExecutor(execId) ` when reduce stage 
fetches failed and the executor is active,  while it does NOT for the above 
problems.
   
   So we should distinguish the failedEpoch of 'executor lost' from the 
fetchFailedEpoch of 'fetch failed' to solve the above problem.
   
   @cloud-fan 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to