Github user attilapiros commented on the issue:
https://github.com/apache/spark/pull/20203
The node blacklisting is tested by unit tests:
- HistoryServerSuite
- TaskSetBlacklistSuite
- AppStatusListenerSuite
And manually with a 2 node cluster:
https://issues.apache.org/jira/secure/attachment/12906833/node_blacklisting_for_stage.png
Here you can see apiros3.gce.test.com was node blacklisted for the stage
because of failures on executor 4 and 5. As expected executor 3 is also
blacklisted even it has no failures itself but sharing the node with 4 and 5.
Spark was started as:
``` bash
./bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G
--num-executors=8 --conf "spark.blacklist.enabled=true" --conf
"spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf
"spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf
"spark.blacklist.application.maxFailedTasksPerExecutor=10" --conf
"spark.eventLog.enabled=true"
```
And the job was:
``` scala
import org.apache.spark.SparkEnv
sc.parallelize(1 to 10000, 10).map { x =>
if (SparkEnv.get.executorId.toInt >= 4) throw new RuntimeException("Bad
executor")
else (x % 3, x)
}.reduceByKey((a, b) => a + b).collect()
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]