[
https://issues.apache.org/jira/browse/SPARK-18691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-18691:
---------------------------------
Labels: bulk-closed (was: )
> Spark can hang if a node goes down during a shuffle
> ---------------------------------------------------
>
> Key: SPARK-18691
> URL: https://issues.apache.org/jira/browse/SPARK-18691
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 1.6.1
> Environment: Running on an AWS EMR cluster using yarn
> Reporter: Nicholas Brown
> Priority: Major
> Labels: bulk-closed
> Attachments: retries.txt
>
>
> We have a 200 node cluster that sometimes hangs if a node goes down. It looks
> like it detects the failure and spins up a new node just fine. However, the
> other 199 nodes then appear to become unresponsive. From looking at the logs
> and code, this is what appears to be happening.
> The other nodes have 36 threads each (as far as I can tell, they are
> processing the shuffle) that timeout while trying to fetch data from the now
> dead node. With the default configuration, they are supposed to retry 3
> times with 5 seconds in between retries, for a max delay of 15 seconds.
> However, because spark.shuffle.io.numConnectionsPerPeer is set to the default
> value of 1, they can only go one at a time. And with a two minute default
> network timeout, that means the delay becomes several hours, which is
> obviously unacceptable. I'll attach a portion of our logs which shows the
> retries getting further and further behind.
> Now I can work around this partially by playing with the configuration
> options mentioned above, particularly by upping the numConnectionsPerPeer.
> But it seems Spark should be able to handle this. One thought is make the
> maxRetries apply between across threads that are accessing the same node.
> I've seen this on 1.6.1, but from looking at the code I suspect it exists in
> the latest version as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]