[
https://issues.apache.org/jira/browse/SPARK-14611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marcelo Vanzin resolved SPARK-14611.
------------------------------------
Resolution: Not A Problem
Yes, that's what re-attempts are for. If you don't want a re-attempt, then you
can disable it in the configuration. If the re-attempt fails, then it will try
again up to the configured limit.
> Second attempt observed after AM fails due to max number of executor failure
> in first attempt
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-14611
> URL: https://issues.apache.org/jira/browse/SPARK-14611
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.1
> Environment: RHEL7 64 bit
> Reporter: Kshitij Badani
>
> I submitted a spark application in yarn-cluster mode. My cluster has two
> Nodemanagers. After submitting the spark application, I tried to restart
> Nodemanager on node1 actively running a few executor and this node was not
> running the AM.
> During the time when the Nodemanager was restarting, 3 of the executors
> running on node2 failed with 'failed to connect to external shuffle server'
> as follows
> java.io.IOException: Failed to connect to node1
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
> at
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
> at
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at
> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
> at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
> at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
> at
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: node1
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> Each of the 3 executors tried to connect to external shuffle service 2 more
> times, all during the period when the NM on node1 was restarting and
> eventually failed
> Since 3 executors failed, the AM exitted with FAILURE status and I can see
> following message in the application logs
> INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max
> number of executor failures (3) reached)
> After this, we saw a 2nd application attempt which succeeded as the NM had
> came up back.
> Should we see a 2nd attempt in such scenarios where multiple executors have
> failed in the 1st attempt due to not being able to connect to external
> shuffle service? What if the 2nd attempt also fails due to similar reason, in
> that case it would be a heavy penalty?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]