GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/1931
[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts
More detail on the issue is described in
[SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is
if we send too many blocking Akka messages that are dependent on each other in
quick successions, then we end up causing a few of these messages to time out
and ultimately kill the executors. As of #1498, we broadcast each RDD whether
or not it is persisted. This means if we create many RDDs (each of which
becomes a broadcast) and the driver performs a GC that cleans up all of these
broadcast blocks, then we end up sending many `RemoveBroadcast` messages in
parallel and trigger the chain of blocking messages at high frequencies.
We do not know of the Akka-level root cause yet, so this is intended to be
a temporary solution until we identify the real issue. I have done some
preliminary testing of enabling blocking and observed that the queue length
remains quite low (< 1000) even under very intensive workloads. This PR also
logs an error message whenever the queue length exceeds a certain capacity,
though I believe this is an unlikely case. Note that this is not a hard cap
that limits the number of items in our reference queue but simply a soft
threshold.
In the long run, we should do something more sophisticated to allow a
limited degree of parallelism through batching clean up tasks or processing
them in a sliding window. In the longer run, we should clean up the whole
`BlockManager*` message passing interface to avoid unnecessarily awaiting on
futures created from Akka asks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark reference-blocking
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1931.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1931
----
commit 0b7e7685c5c1c7dda73c64a9b5ed929ff3484a8d
Author: Andrew Or <[email protected]>
Date: 2014-08-13T22:19:42Z
Block on cleaning tasks by default + log error on queue full
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]