[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
TianyiMa updated SPARK-47279: ----------------------------- Attachment: executor_4.png > spark driver process hangs due to "unable to create new native thread" > ---------------------------------------------------------------------- > > Key: SPARK-47279 > URL: https://issues.apache.org/jira/browse/SPARK-47279 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 3.1.1, 3.5.0 > Reporter: TianyiMa > Priority: Major > Attachments: driver_submit_task.png, executor_4.png > > > we encounter that spark driver hangs for about 11 hours, and finall killed > by user. In the driver log there is an error log: > {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error > happened while processing message in the inbox for CoarseGrainedScheduler > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:719) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) > at > org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) > at > org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > > In detailed analysis, we found that, the driver submit a task 0.0 at > "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", > then executor 4 sends result to the driver. But in the same time, there is > not sufficient memory in the the server that running the driver, the driver > "unable to create new native thread" to handle the successful result of task > 0.0, then the driver think task 0.0 has not finished and waiting for the > "missed result" forever. > > driver submit task: > !driver_submit_task.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org