slave shutdown in progress fails whole job

Stephen Haberman Mon, 26 Aug 2013 08:54:50 -0700

Hi,

We're running Spark on EMR, and are seeing the whole job aborted when a
slave goes down.


What happens is that the slave starts throwing IllegalStateException
"shutdown in progress" errors (see [1]) but Spark thinks this is a
regular application error, so retries the task, which gets scheduled
back to the same node, fails again, repeat 4x, then aborts the job.

So, there are two things I'm wondering about:

1) Can Spark recognize these "shutdown in progress" errors and more
gracefully mark the node as offline? Should this already be handled?

2) It seems like if a task has failed, for any reason, it shouldn't be
sent back to the same node. Especially 4x in a row. At some point the
node should be marked as suspect, and we should try the task somewhere
else. And only if a task has failed on >1 nodes do we fail the job. And
if a node has failed >1 tasks (that then succeeded on other nodes), we
just consider the node dead. Does this seem reasonable?

I'm willing to hack around on either...it's been a few months since
I've poked around in DAGScheduler and I saw a pull requests go by. Just
wondering if maybe these are already handled now. Or if things are in
flux and I should hold off.

Thanks!

- Stephen

[1]: https://gist.github.com/stephenh/6342916

slave shutdown in progress fails whole job

Reply via email to