Thomas Newton created SPARK-57465:
-------------------------------------

             Summary: `RejectedExecutionException` can consume all retries for 
a task
                 Key: SPARK-57465
                 URL: https://issues.apache.org/jira/browse/SPARK-57465
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 4.1.2, 4.1.1
         Environment: Ubuntu 24.04

Kubernetes

PySpark 4.1.1
            Reporter: Thomas Newton


It seems like occasionally tasks get submitted to executors that in in the 
process of  shutting down and the thread pool on those executors reject the new 
tasks inside `Executor.launchTask`, giving a `RejectedExecutionException`. 
{code:java}
: org.apache.spark.SparkException: [STAGE_MATERIALIZATION_MULTIPLE_FAILURES] 
Multiple failures (2) in stage materialization:   1. SparkException: Job 
aborted due to stage failure: Task 3 in stage 96084.0 failed 8 times, most 
recent failure: Lost task 3.7 in stage 96084.0 (TID 1245120) (10.132.67.92 
executor 9): java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@1e77f7d8 rejected from 
java.util.concurrent.ThreadPoolExecutor@12cd8cbf[Shutting down, pool size = 15, 
active threads = 15, queued tasks = 0, completed tasks = 233337]    at 
java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365)
      at org.apache.spark.executor.Executor.launchTask(Executor.scala:433)    
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:187)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:116) 
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:216) at 
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)    at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:76)
     at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:42)     at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840) {code}
A more complete stack trace is attached. 

 

I think this problem was introduced by 
[https://github.com/apache/spark/pull/52792]

The attached patch file seems to fix it. If there is a failure inside 
`Executor.launchTask`, it first checks whether the executor is shutting down. 
If yes, then it returns an error status that is not counted towards the task 
retries. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to