[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sandeep Katta updated SPARK-21564: ---------------------------------- Attachment: image-2021-03-03-13-02-06-669.png > TaskDescription decoding failure should fail the task > ----------------------------------------------------- > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: Andrew Ash > Priority: Major > Labels: bulk-closed > Attachments: image-2021-03-03-13-02-06-669.png, > image-2021-03-03-13-02-31-744.png > > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > For details on the cause of that exception, see SPARK-21563 > We've since changed the application and have a proposed fix in Spark at the > ticket above, but it was troubling that decoding the TaskDescription wasn't > failing the tasks. So the Spark job ended up hanging and making no progress, > permanently stuck, because the driver thinks the task is running but the > thread has died in the executor. > We should make a change around > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 > so that when that decode throws an exception, the task is marked as failed. > cc [~kayousterhout] [~irashid] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org