GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/1931

    [SPARK-3015] Block on cleaning tasks to prevent Akka timeouts

    More detail on the issue is described in 
[SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is 
if we send too many blocking Akka messages that are dependent on each other in 
quick successions, then we end up causing a few of these messages to time out 
and ultimately kill the executors. As of #1498, we broadcast each RDD whether 
or not it is persisted. This means if we create many RDDs (each of which 
becomes a broadcast) and the driver performs a GC that cleans up all of these 
broadcast blocks, then we end up sending many `RemoveBroadcast` messages in 
parallel and trigger the chain of blocking messages at high frequencies.
    
    We do not know of the Akka-level root cause yet, so this is intended to be 
a temporary solution until we identify the real issue. I have done some 
preliminary testing of enabling blocking and observed that the queue length 
remains quite low (< 1000) even under very intensive workloads. This PR also 
logs an error message whenever the queue length exceeds a certain capacity, 
though I believe this is an unlikely case. Note that this is not a hard cap 
that limits the number of items in our reference queue but simply a soft 
threshold.
    
    In the long run, we should do something more sophisticated to allow a 
limited degree of parallelism through batching clean up tasks or processing 
them in a sliding window. In the longer run, we should clean up the whole 
`BlockManager*` message passing interface to avoid unnecessarily awaiting on 
futures created from Akka asks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark reference-blocking

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1931.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1931
    
----
commit 0b7e7685c5c1c7dda73c64a9b5ed929ff3484a8d
Author: Andrew Or <[email protected]>
Date:   2014-08-13T22:19:42Z

    Block on cleaning tasks by default + log error on queue full

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to