[ https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298333#comment-15298333 ]
Barry Becker commented on SPARK-14234: -------------------------------------- Will this fix be back-ported to 1.6.x? We are encountering what appears to be this same issue when using spark 1.6.1 and jobserver 0.6.2. Looking into the logs, we narrowed down the problem to killing of a task and can successfully reproduce this by killing two tasks in a row. It appears that mesos slave gets blacklisted after a repeated failure and never gets back up. First time a task is killed we can see this in the spark-job-server.log file: {code} [2016-04-22 10:11:56,919] INFO k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - Job 0ecdbe5a-bde1-4818-ba24-b5af0fbee5af killed [2016-04-22 10:11:56,921] ERROR k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id 0ecdbe5a-bde1-4818-ba24-b5af0fbee5af [2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 99 [2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Stage 99 was cancelled [2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is trying to kill task 0.0 in stage 99.0 (TID 736) [2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is trying to kill task 1.0 in stage 99.0 (TID 737) [2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor killed task 1.0 in stage 99.0 (TID 737) [2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor killed task 0.0 in stage 99.0 (TID 736) [2016-04-22 10:11:56,933] ERROR rkUncaughtExceptionHandler [] [] - Uncaught exception in thread Thread[Executor task launch worker-25,5,main] java.lang.Error: java.nio.channels.ClosedByInterruptException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.ClosedByInterruptException {code} A few minutes later, another task gets killed: {code} [2016-04-22 10:16:49,890] INFO k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - Job cf0c58e9-6496-4d5d-8a6f-0072ca742e33 killed [2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 101 [2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Stage 101 was cancelled [2016-04-22 10:16:49,892] ERROR k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id cf0c58e9-6496-4d5d-8a6f-0072ca742e33 [2016-04-22 10:16:50,254] ERROR cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Lost executor 20160216-173849-2066065046-5050-48639-S0 on ra.engr.sgi.com: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. [2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] [akka://JobServer/user/context-supervisor/sql-context] - Lost task 0.0 in stage 101.0 (TID 738, ra.engr.sgi.com): ExecutorLostFailure (executor 20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. [2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] [akka://JobServer/user/context-supervisor/sql-context] - Lost task 1.0 in stage 101.0 (TID 739, ra.engr.sgi.com): ExecutorLostFailure (executor 20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. [2016-04-22 10:16:50,254] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Removed TaskSet 101.0, whose tasks have all completed, from pool [2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] [akka://JobServer/user/context-supervisor/sql-context] - Trying to remove executor 20160216-173849-2066065046-5050-48639-S0 from BlockManagerMaster. [2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] [akka://JobServer/user/context-supervisor/sql-context] - Removing block manager BlockManagerId(20160216-173849-2066065046-5050-48639-S0, ra.engr.sgi.com, 46374) [2016-04-22 10:16:50,255] INFO storage.BlockManagerMaster [] [akka://JobServer/user/context-supervisor/sql-context] - Removed 20160216-173849-2066065046-5050-48639-S0 successfully in removeExecutor [2016-04-22 10:16:50,283] INFO oarseMesosSchedulerBackend [] [] - Mesos task 1 is now TASK_FAILED [2016-04-22 10:16:50,284] INFO oarseMesosSchedulerBackend [] [] - Blacklisting Mesos slave 20160216-173849-2066065046-5050-48639-S0 due to too many failures; is Spark installed on it? {code} > Executor crashes for TaskRunner thread interruption > --------------------------------------------------- > > Key: SPARK-14234 > URL: https://issues.apache.org/jira/browse/SPARK-14234 > Project: Spark > Issue Type: Bug > Components: Spark Core > Reporter: Devaraj K > Assignee: Devaraj K > Fix For: 2.0.0 > > > If the TaskRunner thread gets interrupted while running due to task kill or > any other reason, the interrupted thread will try to update the task status > as part of the exception handling and fails with the below exception. This is > happening from all of these catch blocks statusUpdate calls, below are the > exceptions correspondingly for all these catch cases. > {code:title=Executor.scala|borderStyle=solid} > case _: TaskKilledException | _: InterruptedException if task.killed > => > ...... > case cDE: CommitDeniedException => > ...... > case t: Throwable => > ...... > {code} > {code:xml} > 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-2,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at > org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ... 2 more > {code} > {code:xml} > 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-4,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460) > .................. > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:326) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ... 2 more > 16/03/29 08:00:29 INFO DiskBlockManager: Shutdown hook called > {code} > {code:xml} > 16/03/29 17:28:56 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-3,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460) > .................. > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:355) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ... 2 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org