GitHub user corruptmemory opened a pull request:

    https://github.com/apache/spark/pull/13279

    [SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle file…

    ## What changes were proposed in this pull request?
    
    This is a backport of https://github.com/apache/spark/pull/11272 to the 
1.6.x version line.
    
    ## How was this patch tested?
    
    This PR was tested the same way as the original PR: manual testing with a 
local mesos cluster.
    
    …s before application has stopped
    
    Mesos shuffle service is completely unusable since Spark 1.6.0 . The 
problem seems to occur since the move from akka to netty in the networking 
layer. Until now, a connection from the driver to each shuffle service was used 
as a signal for the shuffle service to determine, whether the driver is still 
running. Since 1.6.0, this connection is closed after 
spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is 
not set) due to it being idle. The shuffle service interprets this as a signal 
that the driver has stopped, despite the driver still being alive. Thus, 
shuffle files are deleted before the application has stopped.
    
    spark shuffle fails with mesos after 2mins: 
https://issues.apache.org/jira/browse/SPARK-12583
    External shuffle service broken w/ Mesos: 
https://issues.apache.org/jira/browse/SPARK-13159
    
    This is a follow up on #11207 .
    
    This PR adds a heartbeat signal from the Driver (in 
MesosExternalShuffleClient) to all registered external mesos shuffle service 
instances. In MesosExternalShuffleBlockHandler, a thread periodically checks 
whether a driver has timed out and cleans an application's shuffle files if 
this is the case.
    
    This patch has been tested on a small mesos test cluster using the 
spark-shell. Log output from mesos shuffle service:
    ```
    16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received 
registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote 
address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
    16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} 
with 
ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047],
 subDirsPerLocalDir=64, shuffleManager=sort}
    16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} 
with 
ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03],
 subDirsPerLocalDir=64, shuffleManager=sort}
    16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 
294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
    16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 
294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
    16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up 
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 
1 local dirs
    16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up 
executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 
1 local dirs
    ```
    Note: there are 2 executors running on this slave.
    
    Author: Bertrand Bossy <[email protected]>
    
    Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.
    
    Initial backport of https://github.com/apache/spark/pull/11272
    
    * No new test failures introduced.
    * Provisional backport complete

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/typesafehub/spark SPARK-12583

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13279
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to