Adrian Bridgett created SPARK-12583:
---------------------------------------

             Summary: spark shuffle fails with mesos after 2mins
                 Key: SPARK-12583
                 URL: https://issues.apache.org/jira/browse/SPARK-12583
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 1.6.0
            Reporter: Adrian Bridgett


See user mailing list "Executor deregistered after 2mins" for more details.

As of 1.6, the driver registers with each shuffle manager via  
MesosExternalShuffleClient.  Once this disconnects, the shuffle manager 
automatically cleans up the data associate with that driver.

However, the connection is terminated before this happens as it's idle. Looking 
at a packet trace, after 120secs the shuffle manager is sending a FIN packet to 
the driver.   The only way to delay this is to increase 
spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.

I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with 
newbie Scala skills to call the TransportContext call with closeIdleConnections 
"false" and this didn't help (hadn't done the network trace first).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to