Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17113#discussion_r119151885
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
    @@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker (
         nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry)
       }
     
    +  private def killBlacklistedExecutor(exec: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing blacklisted executor id $exec " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          a.killExecutors(Seq(exec), true, true)
    +        case None =>
    +          logWarning(s"Not attempting to kill blacklisted executor id 
$exec " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  private def killExecutorsOnBlacklistedNode(node: String): Unit = {
    +    if (conf.get(config.BLACKLIST_KILL_ENABLED)) {
    +      allocationClient match {
    +        case Some(a) =>
    +          logInfo(s"Killing all executors on blacklisted host $node " +
    +            s"since spark.blacklist.killBlacklistedExecutors is set.")
    +          if (a.killExecutorsOnHost(node) == false) {
    +            logError(s"Killing executors on node $node failed.")
    +          }
    +        case None =>
    +          logWarning(s"Not attempting to kill executors on blacklisted 
host $node " +
    +            s"since allocation client is not defined.")
    +      }
    +    }
    +  }
    +
    +  def updateBlacklistForFetchFailure(host: String, exec: String): Unit = {
    +    if (BLACKLIST_FETCH_FAILURE_ENABLED) {
    +      // If we blacklist on fetch failures, we are implicitly saying that 
we believe the failure is
    +      // non-transient, and can't be recovered from (even if this is the 
first fetch failure).
    --- End diff --
    
    I would expand this comment to explain why we don't wait for multiple 
failures -- stage is retried after just one failure, so we don't always get a 
chance to collect multiple fetch failures.
    
    But to be honest, I still really wonder about that part.  So what if takes 
a couple stages to get enough fetch failures to blacklist the node?  This seems 
so drastic that I'm having a tough time imagining a scenario where I'd actually 
recommend this setting to a user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to