GitHub user bbossy opened a pull request:
https://github.com/apache/spark/pull/11272
[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files
before application has stopped
## Problem description:
Mesos shuffle service is completely unusable since Spark 1.6.0 . The
problem seems to occur since the move from akka to netty in the networking
layer. Until now, a connection from the driver to each shuffle service was used
as a signal for the shuffle service to determine, whether the driver is still
running. Since 1.6.0, this connection is closed after
spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is
not set) due to it being idle. The shuffle service interprets this as a signal
that the driver has stopped, despite the driver still being alive. Thus,
shuffle files are deleted before the application has stopped.
### Context and analysis:
spark shuffle fails with mesos after 2mins:
https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos:
https://issues.apache.org/jira/browse/SPARK-13159
## What changes were proposed in this pull request?
This PR adds a heartbeat signal from the Driver (in
MesosExternalShuffleClient) to all registered external mesos shuffle service
instances. In MesosExternalShuffleBlockHandler, a thread periodically checks
whether a driver has timed out and cleans an application's shuffle files if
this is the case.
## How was the this patch tested?
This patch has been tested on a small mesos test cluster using the
spark-shell. Log output from mesos shuffle service:
```
16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received
registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote
address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}
with
ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047],
subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}
with
ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03],
subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application
294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application
294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s
1 local dirs
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s
1 local dirs
```
Note: there are 2 executors running on this slave.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/bbossy/spark
SPARK-12583-mesos-shuffle-service-heartbeat
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11272.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11272
----
commit bd30655d92f13cef00fb10afa7c4872c99736d81
Author: Bertrand Bossy <[email protected]>
Date: 2016-02-19T13:59:34Z
SPARK-12583: Heartbeat from MesosExternalShuffleClient to shuffle service
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]