[ 
https://issues.apache.org/jira/browse/SPARK-57465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089227#comment-18089227
 ] 

Anupam Yadav commented on SPARK-57465:
--------------------------------------

I'm looking into this and working on a fix.

> `RejectedExecutionException` can consume all retries for a task
> ---------------------------------------------------------------
>
>                 Key: SPARK-57465
>                 URL: https://issues.apache.org/jira/browse/SPARK-57465
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 4.1.1, 4.1.2
>         Environment: Ubuntu 24.04
> Kubernetes
> PySpark 4.1.1
>            Reporter: Thomas Newton
>            Priority: Major
>         Attachments: fix_execution_rejected_handling.patch, traceback.txt
>
>
> It seems like occasionally tasks get submitted to executors that in in the 
> process of  shutting down and the thread pool on those executors reject the 
> new tasks inside `Executor.launchTask`, giving a 
> `RejectedExecutionException`. 
> {code:java}
> : org.apache.spark.SparkException: [STAGE_MATERIALIZATION_MULTIPLE_FAILURES] 
> Multiple failures (2) in stage materialization:   1. SparkException: Job 
> aborted due to stage failure: Task 3 in stage 96084.0 failed 8 times, most 
> recent failure: Lost task 3.7 in stage 96084.0 (TID 1245120) (10.132.67.92 
> executor 9): java.util.concurrent.RejectedExecutionException: Task 
> org.apache.spark.executor.Executor$TaskRunner@1e77f7d8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@12cd8cbf[Shutting down, pool size = 
> 15, active threads = 15, queued tasks = 0, completed tasks = 233337]  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365)
>       at org.apache.spark.executor.Executor.launchTask(Executor.scala:433)    
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:187)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:116) at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:216) at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)    at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:76)
>      at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:42)     
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:840) {code}
> A more complete stack trace is attached. 
>  
> I think this problem was introduced by 
> [https://github.com/apache/spark/pull/52792]
> The attached patch file seems to fix it. If there is a failure inside 
> `Executor.launchTask`, it first checks whether the executor is shutting down. 
> If yes, then it returns an error status that is not counted towards the task 
> retries. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to