Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/21068
Ah, sorry haven't had time to get to back to this. Yeah the driver
interaction could be an issue. But whether its the limit or just the yarn side
blacklisting I think you would need some interaction there, right? Or you
would have to have similar logic that says all nodes blacklisted in the yarn
side and tell the application to fail. Otherwise you could blacklist the
entire cluster based on container launch failures and it would be stuck because
the driver blacklist wouldn't know about it.
Personally I'd rather see a limit rather then the current failure as I
think it would be more robust. In my opinion I would rather try it at some
point and have it just fail the max task failures then not try at all. I've
seen jobs fail if they only have 1 executor that gets blacklisted that could
have worked fine if retried. The blacklisting logic isn't perfect. We do have
the kill on blacklist which I haven't used much at this point which would also
help that I guess.
I guess for this I'm fine with removing the limit for now since that is
the current behavior in the driver side since communicating back to the driver
blacklist could be complicated. We do need to handle the all nodes are
blacklisted on the yarn side issue though.
I was going to say this could just be handled by making sure
spark.yarn.max.executor.failures is sane. Since I don't think that is really
the case now since with dynamic allocation its just based on Int.MaxValue or
whatever the user specifies which could have nothing to do with the actual
cluster size but you might have a small cluster and someone might want to try
hard and allow it to fail twice per node or something like that if the yarn
blacklisting is off. So do we just need another check to fail if all or after
certain percent blacklisted. Did you have something in mind to replace the
limit?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]