Some new features are about to land in spark to improve Spark's ability to handle bad executors and nodes. These are some significant changes, and we'd like to gather more input from the community about it, especially folks that use *large clusters*.
We've spent a lot of time discussing the right approach for this feature; there are some tough tradeoffs that need to be made, and though we made our best efforts at the right balance, more community input would be very helpful. The feature is marked experimental and off by default for now while we see how it works for the broader community. There is a design doc attached to SPARK-8425 with a more thorough explanation, but I wanted to highlight some key aspects. Before these changes, spark was particularly vulnerable to flaky hardware on one node. The most common example is one bad disk somewhere in your cluster. After the changes go in, Spark will avoid scheduling tasks on executors and nodes with previous failures. A brief summary of the behavior when blacklisting is turned on: 1) when a task fails, it won't ever be scheduled on the same executor. If it fails on another executor on the same node, that task won't ever be scheduled on the same node again. 2) If an executor has two failed tasks within one task set, it is blacklisted for the entire task set. Similarly for a node. 3) After a taskset completes _successfully_, Spark check's whether it should blacklist any executors or nodes for the entire application. Executors or nodes that have more than 2 failures in *successful* task sets are blacklisted for the application. They are blacklisted for a configurable timeout, with a default of 1 hour. There are a number of confs added to control the cutoffs used, but we hope these defaults are sensible for most users. These changes will land in two patches. SPARK-17675 adds the changes described within a taskset. SPARK-8425 will follow soon after that with application-level blacklisting. There are other follow up changes, eg. adding this info to the UI, but the core functionality is in those. Choosing the right behavior is particularly tough because Spark doesn't know why tasks fail -- is it a real hardware issue? Or is there just temporary resource contention, perhaps due to an ill-behaved application? We needed to balance: 1) Robustness -- making sure Spark would survive a bad node under a wide variety of problems 2) Resource utilization -- avoid blacklisting resources when they are perfectly fine 3) Ease of use -- the feature needs to be easy to configure, and have sensible logging etc. when there are issues. Hopefully the approach delivers a good balance for the majority of use cases, but we're particularly interested in hearing more from use cases with large clusters where hardware failures are encountered more frequently. A number of people have put in a lot of effort on driving discussion and implementation of this issue, in particular: Kay Ousterhout, Tom Graves, Mark Hamstra, Mridul Muralidharan, Wei Mao and Jerry Shao (sorry if I missed anybody!). thanks, Imran