[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

TianyiMa (Jira) Mon, 04 Mar 2024 19:26:13 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


TianyiMa updated SPARK-47279:
-----------------------------
    Description: 
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

driver submit task:

!driver_submit_task.png!

 

 

  was:
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

driver submit task:

 

 

 


> spark driver process hangs due to "unable to create new native thread"
> ----------------------------------------------------------------------
>
>                 Key: SPARK-47279
>                 URL: https://issues.apache.org/jira/browse/SPARK-47279
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 3.1.1, 3.5.0
>            Reporter: TianyiMa
>            Priority: Major
>         Attachments: driver_submit_task.png
>
>
> we encounter that spark driver hangs for about 11 hours,  and finall killed 
> by user. In the driver log there is an error log: 
> {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
> happened while processing message in the inbox for CoarseGrainedScheduler
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:719)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>         at 
> org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
>  
> In detailed analysis, we found that, the driver submit a task 0.0 at 
> "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
> then executor 4 sends result to the driver. But in the same time, there is 
> not sufficient memory in the the server that running the driver, the driver 
> "unable to create new native thread" to handle the successful result of task 
> 0.0, then the driver think task 0.0 has not finished and waiting for the 
> "missed result" forever.
>  
> driver submit task:
> !driver_submit_task.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

Reply via email to