[jira] [Created] (SPARK-14611) Second attempt observed after AM fails due to max number of executor failure in first attempt

Kshitij Badani (JIRA) Wed, 13 Apr 2016 15:47:37 -0700

Kshitij Badani created SPARK-14611:
--------------------------------------

             Summary: Second attempt observed after AM fails due to max number 
of executor failure in first attempt
                 Key: SPARK-14611
                 URL: https://issues.apache.org/jira/browse/SPARK-14611
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.1
         Environment: RHEL7 64 bit
            Reporter: Kshitij Badani



I submitted a spark application in yarn-cluster mode. My cluster has two 
Nodemanagers. After submitting the spark application, I tried to restart 
Nodemanager on node1 actively running a few executor and this node was not 
running the AM. 

During the time when the Nodemanager was restarting, 3 of the executors running 
on node2 failed with 'failed to connect to external shuffle server' as follows

java.io.IOException: Failed to connect to node1
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: node1
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

Each of the 3 executors tried to connect to external shuffle service 2 more 
times, all during the period when the NM on node1 was restarting and eventually 
failed

Since 3 executors failed, the AM exitted with FAILURE status and I can see 
following message in the application logs

INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max 
number of executor failures (3) reached)

After this, we saw a 2nd application attempt which succeeded as the NM had came 
up back.

Should we see a 2nd attempt in such scenarios where multiple executors have 
failed in the 1st attempt due to not being able to connect to external shuffle 
service? What if the 2nd attempt also fails due to similar reason, in that case 
it would be a heavy penalty?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-14611) Second attempt observed after AM fails due to max number of executor failure in first attempt

Reply via email to