Hi, We're running Spark on EMR, and are seeing the whole job aborted when a slave goes down.
What happens is that the slave starts throwing IllegalStateException "shutdown in progress" errors (see [1]) but Spark thinks this is a regular application error, so retries the task, which gets scheduled back to the same node, fails again, repeat 4x, then aborts the job. So, there are two things I'm wondering about: 1) Can Spark recognize these "shutdown in progress" errors and more gracefully mark the node as offline? Should this already be handled? 2) It seems like if a task has failed, for any reason, it shouldn't be sent back to the same node. Especially 4x in a row. At some point the node should be marked as suspect, and we should try the task somewhere else. And only if a task has failed on >1 nodes do we fail the job. And if a node has failed >1 tasks (that then succeeded on other nodes), we just consider the node dead. Does this seem reasonable? I'm willing to hack around on either...it's been a few months since I've poked around in DAGScheduler and I saw a pull requests go by. Just wondering if maybe these are already handled now. Or if things are in flux and I should hold off. Thanks! - Stephen [1]: https://gist.github.com/stephenh/6342916
