[
https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen reassigned SPARK-44542:
------------------------------------
Assignee: YE
> eagerly load SparkExitCode class in SparkUncaughtExceptionHandler
> -----------------------------------------------------------------
>
> Key: SPARK-44542
> URL: https://issues.apache.org/jira/browse/SPARK-44542
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.1.3, 3.3.2, 3.4.1
> Reporter: YE
> Assignee: YE
> Priority: Trivial
> Attachments: image-2023-07-25-16-46-03-989.png,
> image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png
>
>
> There are two background for this improvement proposal:
> 1. When running spark on yarn, the disk might be corrupted during application
> running. The corrupted disk might contain the spark jars(cache archive from
> spark.yarn.archive). In that case , the executor JVM cannot load any spark
> related classes any more.
> 2. Spark leverages the OutputCommitCoordinator to avoid data race between
> speculate tasks so that no tasks could commit the same partition in the same
> time. In other words, once a task's commit request is allowed, other commit
> requests would be denied until the committing task is failed.
>
> We encountered a corner case combined the above two cases, which makes the
> spark hangs. A short timeline could be described as below:
> # task 5372(tid: 21662) starts running in 21:55
> # the disk contains the spark archive for that task/executor is corrupted,
> thus making the archive inaccessible from executor's JVM perspective, it
> happened around 22:00
> # the task continues running, at 22:05, it requests commit from coordinator
> and performs the commit.
> # however due the corrupted disk, some exception raised in the executor JVM.
> # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is
> corrupted, the handler itself throws an exception, and the halt process
> throws an exception too.
> # The executor is hanging there, no more tasks are running. However the
> authorized commit request is still valid in the driver side
> # Speculate tasks start to click in, due to no commit permission, all
> speculate tasks are killed/denied.
> # The job is hanging until our SRE killed the container from outside.
> Some screenshot are provided below.
> !image-2023-07-25-16-46-03-989.png!
> !image-2023-07-25-16-46-28-158.png!
> !image-2023-07-25-16-46-42-522.png!
> For this specific case: I'd like to the propose to eagerly load SparkExitCode
> class in the
> SparkUncaughtExceptionHandler, so that the halt process could be executed
> rather than throws an exception as SparkExitCode is not loadable during the
> previous scenario.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]