Aaruna Godthi created SPARK-34109:
-------------------------------------
Summary: Killing executors excluded on failure, results in
additional executors being marked as excluded due to fetch failures
Key: SPARK-34109
URL: https://issues.apache.org/jira/browse/SPARK-34109
Project: Spark
Issue Type: Bug
Components: Kubernetes, Shuffle, Spark Core
Affects Versions: 3.0.1, 3.0.0
Reporter: Aaruna Godthi
Configuration:
```
spark.excludeOnFailure.enabled: true # aka deprecated spark.blacklist.enabled
spark.excludeOnFailure.application.fetchFailure.enabled: true # aka deprecated
spark.blacklist.application.fetchFailure.enabled
spark.excludeOnFailure.killExcludedExecutors: true # aka deprecated
spark.blacklist.killBlacklistedExecutors
```
In this case, we have noticed when a few executors are excluded due to task
failures (maybe due to host issues), then those executors are killed after
being excluded.
However, when other executors try to fetch shuffle blocks from these killed
executors, then these other executors also end up getting excluded due to
`spark.excludeOnFailure.application.fetchFailure.enabled`.
Instead, the fetch failures in case of fetch from these excluded executors
should not be considered when excluding executors based on
`spark.excludeOnFailure.application.fetchFailure.enabled`
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]