seregasheypak commented on a change in pull request #23223: 
[SPARK-26269][YARN]Yarnallocator should have same blacklist behaviour with yarn 
to maxmize use of cluster resource
URL: https://github.com/apache/spark/pull/23223#discussion_r249580532
 
 

 ##########
 File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ##########
 @@ -612,11 +612,14 @@ private[yarn] class YarnAllocator(
             val message = "Container killed by YARN for exceeding physical 
memory limits. " +
               s"$diag Consider boosting ${EXECUTOR_MEMORY_OVERHEAD.key}."
             (true, message)
+          case exit_status if 
NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(exit_status) =>
+            (true, "Container marked as failed: " + containerId + onHostStr +
 
 Review comment:
   >  concrete case where this really helps and needed.
   
   There is 1K nodes cluster and jobs have performance degradation because of a 
single node. It's rather hard to convince Cluster Ops to decommission node 
because of "performance degradation". Imagine 10 dev teams chase single ops 
team for valid reason (node has problems) or because code has a bug or data is 
skewed or spots on the sun.
   
   Simple solution:
   - rerun failed / delayed job and blacklist "problematic" node.
   - Report about the problem if job works w/o anomalies
   
   Results
   - Ops are not spammed with a weird requests from devs
   - devs are not blocked because of really bad node.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to