Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21068#discussion_r185539768
  
    --- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
    @@ -602,8 +569,7 @@ private[yarn] class YarnAllocator(
                   completedContainer.getDiagnostics,
                   PMEM_EXCEEDED_PATTERN))
               case _ =>
    -            // Enqueue the timestamp of failed executor
    -            failedExecutorsTimeStamps.enqueue(clock.getTimeMillis())
    +            
allocatorBlacklistTracker.handleResourceAllocationFailure(hostOpt)
    --- End diff --
    
    are we actually totally sure this is an allocation failure?  Its really any 
failure which isn't covered by the cases above, right?  For example, it could 
be somebody logging in to the box and running `kill -9` on the executor (people 
do weird things), or sudden hardware failure, etc.  I think the handling is 
still fine, but I'd add a comment here explaining a bit.
    
    If we *really* want to separate out allocation failures, we'd need the 
handling to be a little more complex, we'd have to keep track of executors that 
have registered with the driver.
    
    @tgravescs sound ok?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to