TianyiMa created SPARK-47279:
--------------------------------
Summary: spark driver process hangs due to "unable to create new
native thread"
Key: SPARK-47279
URL: https://issues.apache.org/jira/browse/SPARK-47279
Project: Spark
Issue Type: Bug
Components: Scheduler, Spark Core
Affects Versions: 3.5.0, 3.1.1
Reporter: TianyiMa
we encounter that spark driver hangs for about 11 hours, and finall killed by
user. In the driver log there is an error log:
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:719)
at
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
at
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
at
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{quote}
In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50"
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then
executor 4 sends result to the driver. But in the same time, there is not
sufficient memory in the the server that running the driver, the driver "unable
to create new native thread" to handle the successful result of task 0.0, then
the driver think task 0.0 has not finished and waiting for the "missed result"
forever.
!image-2024-03-05-11-12-00-227.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]