Iulian Dragos created SPARK-13159:
-------------------------------------

             Summary: External shuffle service broken w/ Mesos
                 Key: SPARK-13159
                 URL: https://issues.apache.org/jira/browse/SPARK-13159
             Project: Spark
          Issue Type: Bug
          Components: Mesos
    Affects Versions: 2.0.0
            Reporter: Iulian Dragos


Dynamic allocation and external shuffle service won't work together on Mesos 
for applications longer than {{spark.network.timeout}}.

After two minutes (default value for {{spark.network.timeout}}), I see a lot of 
FileNotFoundExceptions and spark jobs just fail.

{code}
16/02/03 15:26:51 WARN TaskSetManager: Lost task 728.0 in stage 3.0 (TID 2755, 
10.0.1.208): java.io.FileNotFoundException: 
/tmp/blockmgr-ea5b2392-626a-4278-8ae3-fb2c4262d758/02/shuffle_1_728_0.data.57efd66e-7662-4810-a5b1-56d7e2d7a9f0
 (No such file or directory)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
        at 
org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56)
        at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:661)
        at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
...
{code}

h3. Analysis

The Mesos external shuffle service needs a way to know when it's safe to delete 
shuffle files for a given application. The current solution (that seemed to 
work fine while the RPC transport was based on Akka) was to open a TCP 
connection between the driver and each external shuffle service. Once the 
driver went down (graciously or crashed), the shuffle service would eventually 
get a notification from the network layer, and delete the corresponding files.

This solution stopped working because it relies on an idle connection, and the 
new Netty-based RPC layer is closing the connection after 
{{spark.network.timeout}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to