weixiuli commented on issue #26206: [SPARK-29551][CORE] Fix a bug about fetch 
failed when an executor is …
URL: https://github.com/apache/spark/pull/26206#issuecomment-545234149
 
 
   When  an executor lost with some reason (eg:. the external shuffle service  
or  host lost on the executor's host  ),  while  the executor  loses time 
happens to be reduce stage  `fetch failed`  from it which is really bad, the 
previous only call  `mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, 
bmAddress)` to mark one map as broken in the map stage  at this time , but 
other maps on the executor are also not available which can only be resubmitted 
by a nest retry stage which is the regression.  
   
   So we should distinguish the failedEpoch of 'executor lost' from the 
fetchFailedEpochof 'fetch failed' to solve the above problem.
   
   As we all know that the previous will call 
`mapOutputTracker.removeOutputsOnHost(host) `or 
   `mapOutputTracker.removeOutputsOnExecutor(execId) ` when reduce stage 
fetches failed and the executor is active,  while it does NOT for the above 
problems.
   
   @cloud-fan 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to