Some new features are about to land in spark to improve Spark's ability to
handle bad executors and nodes.  These are some significant changes, and
we'd like to gather more input from the community about it, especially
folks that use *large clusters*.

We've spent a lot of time discussing the right approach for this feature;
there are some tough tradeoffs that need to be made, and though we made our
best efforts at the right balance, more community input would be very
helpful.  The feature is marked experimental and off by default for now
while we see how it works for the broader community.

There is a design doc attached to SPARK-8425 with a more thorough
explanation, but I wanted to highlight some key aspects.  Before these
changes, spark was particularly vulnerable to flaky hardware on one node.
The most common example is one bad disk somewhere in your cluster.

After the changes go in, Spark will avoid scheduling tasks on executors and
nodes with previous failures.  A brief summary of the behavior when
blacklisting is turned on:

1) when a task fails, it won't ever be scheduled on the same executor.  If
it fails on another executor on the same node, that task won't ever be
scheduled on the same node again.

2) If an executor has two failed tasks within one task set, it is
blacklisted for the entire task set.  Similarly for a node.

3) After a taskset completes _successfully_, Spark check's whether it
should blacklist any executors or nodes for the entire application.
Executors or nodes that have more than 2 failures in *successful* task sets
are blacklisted for the application.  They are blacklisted for a
configurable timeout, with a default of 1 hour.

There are a number of confs added to control the cutoffs used, but we hope
these defaults are sensible for most users.

These changes will land in two patches.  SPARK-17675 adds the changes
described within a taskset.  SPARK-8425 will follow soon after that with
application-level blacklisting.  There are other follow up changes, eg.
adding this info to the UI, but the core functionality is in those.

Choosing the right behavior is particularly tough because Spark doesn't
know why tasks fail -- is it a real hardware issue?  Or is there just
temporary resource contention, perhaps due to an ill-behaved application?
We needed to balance:

1) Robustness -- making sure Spark would survive a bad node under a wide
variety of problems
2) Resource utilization  -- avoid blacklisting resources when they are
perfectly fine
3) Ease of use -- the feature needs to be easy to configure, and have
sensible logging etc. when there are issues.

Hopefully the approach delivers a good balance for the majority of use
cases, but we're particularly interested in hearing more from use cases
with large clusters where hardware failures are encountered more frequently.

A number of people have put in a lot of effort on driving discussion and
implementation of this issue, in particular: Kay Ousterhout, Tom Graves,
Mark Hamstra, Mridul Muralidharan, Wei Mao and Jerry Shao (sorry if I
missed anybody!).


Reply via email to