Thomas Newton created SPARK-57465:
-------------------------------------
Summary: `RejectedExecutionException` can consume all retries for
a task
Key: SPARK-57465
URL: https://issues.apache.org/jira/browse/SPARK-57465
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 4.1.2, 4.1.1
Environment: Ubuntu 24.04
Kubernetes
PySpark 4.1.1
Reporter: Thomas Newton
It seems like occasionally tasks get submitted to executors that in in the
process of shutting down and the thread pool on those executors reject the new
tasks inside `Executor.launchTask`, giving a `RejectedExecutionException`.
{code:java}
: org.apache.spark.SparkException: [STAGE_MATERIALIZATION_MULTIPLE_FAILURES]
Multiple failures (2) in stage materialization: 1. SparkException: Job
aborted due to stage failure: Task 3 in stage 96084.0 failed 8 times, most
recent failure: Lost task 3.7 in stage 96084.0 (TID 1245120) (10.132.67.92
executor 9): java.util.concurrent.RejectedExecutionException: Task
org.apache.spark.executor.Executor$TaskRunner@1e77f7d8 rejected from
java.util.concurrent.ThreadPoolExecutor@12cd8cbf[Shutting down, pool size = 15,
active threads = 15, queued tasks = 0, completed tasks = 233337] at
java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065)
at
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:433)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:187)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:216) at
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:76)
at
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:42) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840) {code}
A more complete stack trace is attached.
I think this problem was introduced by
[https://github.com/apache/spark/pull/52792]
The attached patch file seems to fix it. If there is a failure inside
`Executor.launchTask`, it first checks whether the executor is shutting down.
If yes, then it returns an error status that is not counted towards the task
retries.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]