Hi,

There is an existing way to handle this situation. Those tasks will become
zombie tasks [1] and they  should not be counted into the tasks failures
[2]. Even the shuffle blocks should be unregistered for the lost executor,
although the lost executor might be already cached as a mapoutput in the
other executors [3] which might generate new fetch failures.

Check the mentioned code parts and run Spark with debug enabled for this
classes to investigate this further. Reading the log and the looking the
code together will help you a lot. And consider using a fresh Spark as there
were changes in this area.

Important: You can avoid this problem altogether by using the external
shuffle service. 
If you happen to be on YARN, please check this link [4].

When the external shuffle service is enabled then shuffle blocks won't be
lost with the dying executor as the blocks can be served by the shuffle
service which is running on the same host where the executor was.

Best Regards,
Attila 

[1]
https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L812

[2]
https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L871

[3]
https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L664

[4]
https://spark.apache.org/docs/2.3.0/running-on-yarn.html#configuring-the-external-shuffle-service

 




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to