[
https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494183#comment-16494183
]
Imran Rashid commented on SPARK-24413:
--------------------------------------
yeah I agree about this. I linked two related jiras that are very close. I
put down some thoughts earlier on those jiras for good ways to do this, but
haven't had time to work on it
> Executor Blacklisting shouldn't immediately fail the application if dynamic
> allocation is enabled and no active executors
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-24413
> URL: https://issues.apache.org/jira/browse/SPARK-24413
> Project: Spark
> Issue Type: Improvement
> Components: Scheduler
> Affects Versions: 2.3.0
> Reporter: Thomas Graves
> Priority: Major
>
> Currently with executor blacklisting enabled, dynamic allocation on, and you
> only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting
> doesn't matter in this case, can be on or off), if you have a task fail that
> results in the 1 executor you have getting blacklisted, then your entire
> application will fail. The error you get is something like:
> Aborting TaskSet 0.0 because task 9 (partition 9)
> cannot run anywhere due to node and executor blacklist.
> This is very undesirable behavior because you may have a huge job but one
> task is the long tail and if it happens to hit a bad node that would
> blacklist it, the entire job fail.
> Ideally since dynamic allocation is on, the schedule should not immediately
> fail but it should let dynamic allocation try to get more executors.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]