[
https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394915#comment-17394915
]
Dongjoon Hyun commented on SPARK-36446:
---------------------------------------
cc [~tgraves] and [~mridul]
> YARN shuffle server restart crashes all dynamic allocation jobs that have
> deallocated an executor
> -------------------------------------------------------------------------------------------------
>
> Key: SPARK-36446
> URL: https://issues.apache.org/jira/browse/SPARK-36446
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 2.4.8, 3.1.2
> Reporter: Adam Kennedy
> Priority: Critical
>
> When dynamic allocation is enabled, executors that deallocate rely on the
> shuffle server to hold blocks and supply them to remaining executors.
> When YARN Shuffle Server restarts (either intentionally or due to a crash),
> it loses block information and relies on being able to contact Executors (the
> locations of which it durably stores) to refetch the list of blocks.
> This mutual dependency on the other to hold block information fails fatally
> under some common scenarios.
> For example, if a Spark application is running under dynamic allocation, some
> amount of executors will almost always shut down.
> If, after this has occurred, any shuffle server crashes, or is restarted
> (either directly when running as a standalone service, or as part of a YARN
> node manager restart) then there is no way to restore block data and it is
> permanently lost.
> Worse, when Executors try to fetch blocks from the shuffle server, the
> shuffle server cannot location the exeutor, decides it doesn't exist, treats
> it as a fatal exception, and causes the application to terminate and crash.
> Thus, in a real world scenario that we observe on a 1000+ node multi-tenant
> cluster where dynamic allocation is on by default, a rolling restart of the
> YARN node managers will cause ALL jobs that have deallocated any executor and
> have shuffles or transferred blocks to the shuffle server in order to shut
> down, to crash.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]