We tried enabling blacklisting for some customers and in the cloud, very 
quickly they end up having 0 executors due to various transient errors. So 
unfortunately I think the current implementation is terrible for cloud 
deployments, and shouldn't be on by default. The heart of the issue is that the 
current implementation is not great at dealing with transient errors vs 
catastrophic errors.

+Chris who was involved with those tests.

On Thu, Mar 28, 2019 at 3:32 PM, Ankur Gupta < ankur.gu...@cloudera.com.invalid 
> wrote:

> 
> Hi all,
> 
> 
> This is a follow-on to my PR: https:/ / github. com/ apache/ spark/ pull/ 
> 24208
> ( https://github.com/apache/spark/pull/24208 ) , where I aimed to enable
> blacklisting for fetch failure by default. From the comments, there is
> interest in the community to enable overall blacklisting feature by
> default. I have listed down 3 different things that we can do and would
> like to gather feedback and see if anyone has objections with regards to
> this. Otherwise, I will just create a PR for the same.
> 
> 
> 1. *Enable blacklisting feature by default*. The blacklisting feature was
> added as part of SPARK-8425 and is available since 2.2.0. This feature was
> deemed experimental and was disabled by default. The feature blacklists an
> executor/node from running a particular task, any task in a particular
> stage or all tasks in application based on number of failures. There are
> various configurations available which control those thresholds.
> Additionally, the executor/node is only blacklisted for a configurable
> time period. The idea is to enable blacklisting feature with existing
> defaults, which are following:
> * spark.blacklist.task.maxTaskAttemptsPerExecutor = 1
> 
> * spark.blacklist.task.maxTaskAttemptsPerNode = 2
> 
> * spark.blacklist.stage.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.stage.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.application.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.application.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.timeout = 1 hour
> 
> 2. *Kill blacklisted executors/nodes by default*. This feature was added
> as part of SPARK-16554 and is available since 2.2.0. This is a follow-on
> feature to blacklisting, such that if an executor/node is blacklisted for
> the application, then it also terminates all running tasks on that
> executor for faster failure recovery.
> 
> 
> 3. *Remove legacy blacklisting timeout config* :
> spark.scheduler.executorTaskBlacklistTime
> 
> 
> Thanks,
> Ankur
>

Reply via email to