[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...

tgravescs Tue, 22 May 2018 07:22:13 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21068
  
    Ah, sorry haven't had time to get to back to this.  Yeah the driver 
interaction could be an issue.  But whether its the limit or just the yarn side 
blacklisting I think you would need some interaction there, right?    Or you 
would have to have similar logic that says all nodes blacklisted in the yarn 
side and tell the application to fail.  Otherwise you could blacklist the 
entire cluster based on container launch failures and it would be stuck because 
the driver blacklist wouldn't know about it.  
    
    Personally I'd rather see a limit rather then the current failure as I 
think it would be more robust.  In my opinion I would rather try it at some 
point and have it just fail the max task failures then not try at all.   I've 
seen jobs fail if they only have 1 executor that gets blacklisted that could 
have worked fine if retried. The blacklisting logic isn't perfect.  We do have 
the kill on blacklist which I haven't used much at this point which would also 
help that I guess.
    
    I guess for this I'm fine with removing  the limit for now since that is 
the current behavior in the driver side since communicating back to the driver 
blacklist could be complicated.    We do need to handle the all nodes are 
blacklisted on the yarn side issue though.  
    
    I was going to say  this could just be handled by making sure  
spark.yarn.max.executor.failures is sane.  Since I don't think that is really 
the case now since with dynamic allocation its just based on Int.MaxValue or 
whatever the user specifies which could have nothing to do with the actual 
cluster size but you might have a small cluster and someone might want to try 
hard and allow it to fail twice per node or something like that if the yarn 
blacklisting is off.  So do we just need another check  to fail if all or after 
certain percent blacklisted.  Did you have something in mind to replace the 
limit?




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...

Reply via email to