[ 
https://issues.apache.org/jira/browse/SPARK-33085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213815#comment-17213815
 ] 

t oo commented on SPARK-33085:
------------------------------

# start long running spark job in cluster mode
 # terminate all spark workers
 # check status of driver

> "Master removed our application" error leads to FAILED driver status instead 
> of KILLED driver status
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33085
>                 URL: https://issues.apache.org/jira/browse/SPARK-33085
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.4.6
>            Reporter: t oo
>            Priority: Major
>
>  
> driver-20200930160855-0316 exited with status FAILED
>  
> I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that 
> myip.87 EC2 instance was terminated at 2020-09-30 16:16
>  
> *I would expect the overall driver status to be KILLED but instead it was 
> FAILED*, my goal is to interpret FAILED status as 'don't rerun as 
> non-transient error faced' but KILLED/ERROR status as 'yes, rerun as 
> transient error faced'. But it looks like FAILED status is being set in below 
> case of transient error:
>   
> Below are driver logs
> {code:java}
> 2020-09-30 16:12:41,183 [main] INFO  
> com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
> s3a://redacted2020-09-30 16:12:41,183 [main] INFO  
> com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
> s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR 
> org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: 
> Remote RPC client disassociated. Likely due to containers exceeding 
> thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 
> 16:16:40,372 [dispatcher-event-loop-15] WARN  
> org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 6.0 (TID 
> 6, myip.87, executor 0): ExecutorLostFailure (executor 0 exited caused by one 
> of the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.2020-09-30 16:16:40,376 [dispatcher-event-loop-13] WARN  
> org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas 
> available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO 
>  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 
> 16:16:40,399 [dispatcher-event-loop-2] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 
> [dispatcher-event-loop-5] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 
> [dispatcher-event-loop-11] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 
> [dispatcher-event-loop-1] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 
> [dispatcher-event-loop-12] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/5 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,410 
> [dispatcher-event-loop-5] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/6 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,420 [dispatcher-event-loop-9] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/6 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,421 
> [dispatcher-event-loop-9] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/7 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,423 [dispatcher-event-loop-15] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/7 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,424 
> [dispatcher-event-loop-15] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/8 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,425 [dispatcher-event-loop-2] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/8 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,425 
> [dispatcher-event-loop-2] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/9 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,427 [dispatcher-event-loop-14] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/9 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,429 
> [dispatcher-event-loop-5] ERROR 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Application 
> has been killed. Reason: Master removed our application: FAILED2020-09-30 
> 16:16:40,438 [main] ERROR 
> org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 
> 564822f2-f2fd-42cd-8d57-b6d5dff145f6.org.apache.spark.SparkException: Job 
> aborted due to stage failure: Master removed our application: FAILED at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272) at 
> com.yotpo.metorikku.output.writers.file.FileOutputWriter.save(FileOutputWriter.scala:134)
>  at 
> com.yotpo.metorikku.output.writers.file.FileOutputWriter.write(FileOutputWriter.scala:65)
>  at 
> com.yotpo.metorikku.metric.Metric.com$yotpo$metorikku$metric$Metric$$writeBatch(Metric.scala:97)
>  at 
> com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:136) at 
> com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:125) at 
> scala.collection.immutable.List.foreach(List.scala:392) at 
> com.yotpo.metorikku.metric.Metric.write(Metric.scala:125) at 
> com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:44) 
> at 
> com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:39) 
> at scala.collection.immutable.List.foreach(List.scala:392) at 
> com.yotpo.metorikku.metric.MetricSet.run(MetricSet.scala:39) at 
> com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:17) 
> at 
> com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:15) 
> at scala.collection.immutable.List.foreach(List.scala:392) at 
> com.yotpo.metorikku.Metorikku$.runMetrics(Metorikku.scala:15) at 
> com.yotpo.metorikku.Metorikku$.delayedEndpoint$com$yotpo$metorikku$Metorikku$1(Metorikku.scala:11)
>  at com.yotpo.metorikku.Metorikku$delayedInit$body.apply(Metorikku.scala:7) 
> at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at 
> scala.App$$anonfun$main$1.apply(App.scala:76) at 
> scala.App$$anonfun$main$1.apply(App.scala:76) at 
> scala.collection.immutable.List.foreach(List.scala:392) at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
>  at scala.App$class.main(App.scala:76) at 
> com.yotpo.metorikku.Metorikku$.main(Metorikku.scala:7) at 
> com.yotpo.metorikku.Metorikku.main(Metorikku.scala) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)2020-09-30
>  16:16:40,457 [stop-spark-context] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Shutting down 
> all executors2020-09-30 16:16:40,461 [stop-spark-context] ERROR 
> org.apache.spark.util.Utils - Uncaught exception in thread 
> stop-spark-contextorg.apache.spark.SparkException: Exception thrown in 
> awaitResult:  at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
> org.apache.spark.deploy.client.StandaloneAppClient.stop(StandaloneAppClient.scala:283)
>  at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:227)
>  at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:124)
>  at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:669)
>  at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2044) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1948) at 
> org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1903)Caused by: 
> org.apache.spark.SparkException: Could not find AppClient. at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) 
> at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at 
> org.apache.spark.rpc.RpcEndpointRef.ask(RpcEndpointRef.scala:63) ... 9 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to