Kshitij Badani created SPARK-14611:
--------------------------------------
Summary: Second attempt observed after AM fails due to max number
of executor failure in first attempt
Key: SPARK-14611
URL: https://issues.apache.org/jira/browse/SPARK-14611
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.6.1
Environment: RHEL7 64 bit
Reporter: Kshitij Badani
I submitted a spark application in yarn-cluster mode. My cluster has two
Nodemanagers. After submitting the spark application, I tried to restart
Nodemanager on node1 actively running a few executor and this node was not
running the AM.
During the time when the Nodemanager was restarting, 3 of the executors running
on node2 failed with 'failed to connect to external shuffle server' as follows
java.io.IOException: Failed to connect to node1
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: node1
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
Each of the 3 executors tried to connect to external shuffle service 2 more
times, all during the period when the NM on node1 was restarting and eventually
failed
Since 3 executors failed, the AM exitted with FAILURE status and I can see
following message in the application logs
INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max
number of executor failures (3) reached)
After this, we saw a 2nd application attempt which succeeded as the NM had came
up back.
Should we see a 2nd attempt in such scenarios where multiple executors have
failed in the 1st attempt due to not being able to connect to external shuffle
service? What if the 2nd attempt also fails due to similar reason, in that case
it would be a heavy penalty?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]