[jira] [Resolved] (SPARK-14611) Second attempt observed after AM fails due to max number of executor failure in first attempt

Marcelo Vanzin (JIRA) Thu, 13 Jul 2017 11:13:32 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Marcelo Vanzin resolved SPARK-14611.
------------------------------------
    Resolution: Not A Problem

Yes, that's what re-attempts are for. If you don't want a re-attempt, then you 
can disable it in the configuration. If the re-attempt fails, then it will try 
again up to the configured limit.

> Second attempt observed after AM fails due to max number of executor failure 
> in first attempt
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14611
>                 URL: https://issues.apache.org/jira/browse/SPARK-14611
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>         Environment: RHEL7 64 bit
>            Reporter: Kshitij Badani
>
> I submitted a spark application in yarn-cluster mode. My cluster has two 
> Nodemanagers. After submitting the spark application, I tried to restart 
> Nodemanager on node1 actively running a few executor and this node was not 
> running the AM. 
> During the time when the Nodemanager was restarting, 3 of the executors 
> running on node2 failed with 'failed to connect to external shuffle server' 
> as follows
> java.io.IOException: Failed to connect to node1
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at 
> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
> at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
> at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: node1
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> Each of the 3 executors tried to connect to external shuffle service 2 more 
> times, all during the period when the NM on node1 was restarting and 
> eventually failed
> Since 3 executors failed, the AM exitted with FAILURE status and I can see 
> following message in the application logs
> INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max 
> number of executor failures (3) reached)
> After this, we saw a 2nd application attempt which succeeded as the NM had 
> came up back.
> Should we see a 2nd attempt in such scenarios where multiple executors have 
> failed in the 1st attempt due to not being able to connect to external 
> shuffle service? What if the 2nd attempt also fails due to similar reason, in 
> that case it would be a heavy penalty?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-14611) Second attempt observed after AM fails due to max number of executor failure in first attempt

Reply via email to