[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

Barry Becker (JIRA) Tue, 24 May 2016 08:12:09 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298333#comment-15298333
 ]


Barry Becker commented on SPARK-14234:
--------------------------------------

Will this fix be back-ported to 1.6.x?
We are encountering what appears to be this same issue when using spark 1.6.1 
and jobserver 0.6.2.

Looking into the logs, we narrowed down the problem to killing of a task and 
can successfully reproduce this by killing two tasks in a row. It appears that 
mesos slave gets blacklisted after a repeated failure and never gets back up.
First time a task is killed we can see this in the spark-job-server.log file:
{code}
[2016-04-22 10:11:56,919] INFO k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - Job 
0ecdbe5a-bde1-4818-ba24-b5af0fbee5af killed
[2016-04-22 10:11:56,921] ERROR k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id 
0ecdbe5a-bde1-4818-ba24-b5af0fbee5af
[2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 99
[2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Stage 99 was cancelled
[2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is 
trying to kill task 0.0 in stage 99.0 (TID 736)
[2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is 
trying to kill task 1.0 in stage 99.0 (TID 737)
[2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor 
killed task 1.0 in stage 99.0 (TID 737)
[2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor 
killed task 0.0 in stage 99.0 (TID 736)
[2016-04-22 10:11:56,933] ERROR rkUncaughtExceptionHandler [] [] - Uncaught 
exception in thread Thread[Executor task launch worker-25,5,main]
java.lang.Error: java.nio.channels.ClosedByInterruptException
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedByInterruptException
{code}
A few minutes later, another task gets killed:
{code}
[2016-04-22 10:16:49,890] INFO k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - Job 
cf0c58e9-6496-4d5d-8a6f-0072ca742e33 killed
[2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 101
[2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Stage 101 was cancelled
[2016-04-22 10:16:49,892] ERROR k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id 
cf0c58e9-6496-4d5d-8a6f-0072ca742e33
[2016-04-22 10:16:50,254] ERROR cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Lost executor 
20160216-173849-2066065046-5050-48639-S0 on ra.engr.sgi.com: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.
[2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - Lost task 0.0 in stage 
101.0 (TID 738, ra.engr.sgi.com): ExecutorLostFailure (executor 
20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running 
tasks) Reason: Remote RPC client disassociated. Likely due to containers 
exceeding thresholds, or network issues. Check driver logs for WARN messages.
[2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - Lost task 1.0 in stage 
101.0 (TID 739, ra.engr.sgi.com): ExecutorLostFailure (executor 
20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running 
tasks) Reason: Remote RPC client disassociated. Likely due to containers 
exceeding thresholds, or network issues. Check driver logs for WARN messages.
[2016-04-22 10:16:50,254] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Removed TaskSet 101.0, 
whose tasks have all completed, from pool
[2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] 
[akka://JobServer/user/context-supervisor/sql-context] - Trying to remove 
executor 20160216-173849-2066065046-5050-48639-S0 from BlockManagerMaster.
[2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] 
[akka://JobServer/user/context-supervisor/sql-context] - Removing block manager 
BlockManagerId(20160216-173849-2066065046-5050-48639-S0, ra.engr.sgi.com, 46374)
[2016-04-22 10:16:50,255] INFO storage.BlockManagerMaster [] 
[akka://JobServer/user/context-supervisor/sql-context] - Removed 
20160216-173849-2066065046-5050-48639-S0 successfully in removeExecutor
[2016-04-22 10:16:50,283] INFO oarseMesosSchedulerBackend [] [] - Mesos task 1 
is now TASK_FAILED
[2016-04-22 10:16:50,284] INFO oarseMesosSchedulerBackend [] [] - Blacklisting 
Mesos slave 20160216-173849-2066065046-5050-48639-S0 due to too many failures; 
is Spark installed on it?
{code}

> Executor crashes for TaskRunner thread interruption
> ---------------------------------------------------
>
>                 Key: SPARK-14234
>                 URL: https://issues.apache.org/jira/browse/SPARK-14234
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Devaraj K
>            Assignee: Devaraj K
>             Fix For: 2.0.0
>
>
> If the TaskRunner thread gets interrupted while running due to task kill or 
> any other reason, the interrupted thread will try to update the task status 
> as part of the exception handling and fails with the below exception. This is 
> happening from all of these catch blocks statusUpdate calls, below are the 
> exceptions correspondingly for all these catch cases.
> {code:title=Executor.scala|borderStyle=solid}
>         case _: TaskKilledException | _: InterruptedException if task.killed 
> =>
>          ......
>         case cDE: CommitDeniedException =>
>          ......
>         case t: Throwable =>
>          ......
> {code}
> {code:xml}
> 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-2,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> Caused by: java.nio.channels.ClosedByInterruptException
>       at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>       at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>       at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
>       at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
>       at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>       at 
> org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
>       at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>       at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>       at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>       at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>       at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>       at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>       at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>       at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>       at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>       at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>       at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>       at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>       at 
> org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
>       at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>       at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       ... 2 more
> {code}
> {code:xml}
> 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-4,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>       at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>       at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>       ..................
>       at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>       at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:326)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       ... 2 more
> 16/03/29 08:00:29 INFO DiskBlockManager: Shutdown hook called
> {code}
> {code:xml}
> 16/03/29 17:28:56 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-3,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> Caused by: java.nio.channels.ClosedByInterruptException
>       at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>       at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>       ..................
>       at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:355)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       ... 2 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

Reply via email to