Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/159#discussion_r10647216
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -59,6 +59,15 @@ private[spark] class TaskSetManager(
       // CPUs to request per task
       val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)
     
    +  /*
    +   * Sometimes if an executor is dead or in an otherwise invalid state, 
the driver
    +   * does not realize right away leading to repeated task failures. If 
enabled,
    +   * this temporarily prevents a task from re-launching on an executor 
where
    +   * it just failed.
    +   */
    +  private[this] val EXECUTOR_TASK_BLACKLIST_TIMEOUT =
    +    conf.getLong("spark.task.executorBlacklistTimeout", 0L)
    --- End diff --
    
    Regarding timeout variable name : I was considering a very specific 
variable name to allow for a better/future approach to handling this issue - 
and at that time allow us to retire this variable without potential variable 
name conflicts (spark.scheduler.blacklistTimeout implies a more general 
black-list handling, which this is not unfortunately); IMO, this is a stop gap 
solution until we add support for a better black list approach which handles 
both executors and blocks.
    
    But until we have that, this will atleast unblock us - thankfully, this is 
not something which a lot of users are hitting (but is fairly common in our 
case unfortunately).
    
    
    Given this, should we expose this in documentation ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to