Adrian Bridgett created SPARK-12583:
---------------------------------------
Summary: spark shuffle fails with mesos after 2mins
Key: SPARK-12583
URL: https://issues.apache.org/jira/browse/SPARK-12583
Project: Spark
Issue Type: Bug
Components: Shuffle
Affects Versions: 1.6.0
Reporter: Adrian Bridgett
See user mailing list "Executor deregistered after 2mins" for more details.
As of 1.6, the driver registers with each shuffle manager via
MesosExternalShuffleClient. Once this disconnects, the shuffle manager
automatically cleans up the data associate with that driver.
However, the connection is terminated before this happens as it's idle. Looking
at a packet trace, after 120secs the shuffle manager is sending a FIN packet to
the driver. The only way to delay this is to increase
spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with
newbie Scala skills to call the TransportContext call with closeIdleConnections
"false" and this didn't help (hadn't done the network trace first).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]