[
https://issues.apache.org/jira/browse/SPARK-57465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Newton updated SPARK-57465:
----------------------------------
Attachment: traceback.txt
> `RejectedExecutionException` can consume all retries for a task
> ---------------------------------------------------------------
>
> Key: SPARK-57465
> URL: https://issues.apache.org/jira/browse/SPARK-57465
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 4.1.1, 4.1.2
> Environment: Ubuntu 24.04
> Kubernetes
> PySpark 4.1.1
> Reporter: Thomas Newton
> Priority: Major
> Attachments: traceback.txt
>
>
> It seems like occasionally tasks get submitted to executors that in in the
> process of shutting down and the thread pool on those executors reject the
> new tasks inside `Executor.launchTask`, giving a
> `RejectedExecutionException`.
> {code:java}
> : org.apache.spark.SparkException: [STAGE_MATERIALIZATION_MULTIPLE_FAILURES]
> Multiple failures (2) in stage materialization: 1. SparkException: Job
> aborted due to stage failure: Task 3 in stage 96084.0 failed 8 times, most
> recent failure: Lost task 3.7 in stage 96084.0 (TID 1245120) (10.132.67.92
> executor 9): java.util.concurrent.RejectedExecutionException: Task
> org.apache.spark.executor.Executor$TaskRunner@1e77f7d8 rejected from
> java.util.concurrent.ThreadPoolExecutor@12cd8cbf[Shutting down, pool size =
> 15, active threads = 15, queued tasks = 0, completed tasks = 233337] at
> java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365)
> at org.apache.spark.executor.Executor.launchTask(Executor.scala:433)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:187)
> at
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:116) at
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:216) at
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:76)
> at
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:42)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840) {code}
> A more complete stack trace is attached.
>
> I think this problem was introduced by
> [https://github.com/apache/spark/pull/52792]
> The attached patch file seems to fix it. If there is a failure inside
> `Executor.launchTask`, it first checks whether the executor is shutting down.
> If yes, then it returns an error status that is not counted towards the task
> retries.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]