Hi, There is an existing way to handle this situation. Those tasks will become zombie tasks [1] and they should not be counted into the tasks failures [2]. Even the shuffle blocks should be unregistered for the lost executor, although the lost executor might be already cached as a mapoutput in the other executors [3] which might generate new fetch failures.
Check the mentioned code parts and run Spark with debug enabled for this classes to investigate this further. Reading the log and the looking the code together will help you a lot. And consider using a fresh Spark as there were changes in this area. Important: You can avoid this problem altogether by using the external shuffle service. If you happen to be on YARN, please check this link [4]. When the external shuffle service is enabled then shuffle blocks won't be lost with the dying executor as the blocks can be served by the shuffle service which is running on the same host where the executor was. Best Regards, Attila [1] https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L812 [2] https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L871 [3] https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L664 [4] https://spark.apache.org/docs/2.3.0/running-on-yarn.html#configuring-the-external-shuffle-service -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org