[ https://issues.apache.org/jira/browse/SPARK-36540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves resolved SPARK-36540. ----------------------------------- Fix Version/s: 3.3.0 Assignee: angerszhu Resolution: Fixed > AM should not just finish with Success when dissconnected > --------------------------------------------------------- > > Key: SPARK-36540 > URL: https://issues.apache.org/jira/browse/SPARK-36540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN > Affects Versions: 3.2.0 > Reporter: angerszhu > Assignee: angerszhu > Priority: Major > Fix For: 3.3.0 > > > We meet a case AM lose connection > {code} > 21/08/18 02:14:15 ERROR TransportRequestHandler: Error sending result > RpcResponse{requestId=5675952834716124039, > body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} to > xx.xx.xx.xx:41420; closing connection > java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > {code} > Check the code about client, when AMEndpoint dissconnected, will finish > Application with SUCCESS final status > {code} > override def onDisconnected(remoteAddress: RpcAddress): Unit = { > // In cluster mode or unmanaged am case, do not rely on the > disassociated event to exit > // This avoids potentially reporting incorrect exit codes if the driver > fails > if (!(isClusterMode || sparkConf.get(YARN_UNMANAGED_AM))) { > logInfo(s"Driver terminated or disconnected! Shutting down. > $remoteAddress") > finish(FinalApplicationStatus.SUCCEEDED, > ApplicationMaster.EXIT_SUCCESS) > } > } > {code} > Nomally in client mode, when application success, driver will stop and AM > loss connection, it's ok that exit with SUCCESS, but if there is a not work > problem cause dissconnected. Still finish with final status is not correct. > Then YarnClientSchedulerBackend will receive application report with final > status with success and stop SparkContext cause application failed but mark > it as a normal stop. > {code} > private class MonitorThread extends Thread { > private var allowInterrupt = true > override def run() { > try { > val YarnAppReport(_, state, diags) = > client.monitorApplication(appId.get, logApplicationReport = false) > logError(s"YARN application has exited unexpectedly with state > $state! " + > "Check the YARN application logs for more details.") > diags.foreach { err => > logError(s"Diagnostics message: $err") > } > allowInterrupt = false > sc.stop() > } catch { > case e: InterruptedException => logInfo("Interrupting monitor thread") > } > } > def stopMonitor(): Unit = { > if (allowInterrupt) { > this.interrupt() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org