[ 
https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407714#comment-17407714
 ] 

Adam Kennedy commented on SPARK-36446:
--------------------------------------

[~tgraves] Yes, we are running with recovery enabled (in the case where shuffle 
server connections are secure).

But the same problem occurs when the shuffle server is run as an independent 
process outside of YARN (insecurely) if they crash and restart.

> YARN shuffle server restart crashes all dynamic allocation jobs that have 
> deallocated an executor
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36446
>                 URL: https://issues.apache.org/jira/browse/SPARK-36446
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 2.4.8, 3.1.2
>            Reporter: Adam Kennedy
>            Priority: Critical
>
> When dynamic allocation is enabled, executors that deallocate rely on the 
> shuffle server to hold blocks and supply them to remaining executors.
> When YARN Shuffle Server restarts (either intentionally or due to a crash), 
> it loses block information and relies on being able to contact Executors (the 
> locations of which it durably stores) to refetch the list of blocks.
> This mutual dependency on the other to hold block information fails fatally 
> under some common scenarios.
> For example, if a Spark application is running under dynamic allocation, some 
> amount of executors will almost always shut down.
> If, after this has occurred, any shuffle server crashes, or is restarted 
> (either directly when running as a standalone service, or as part of a YARN 
> node manager restart) then there is no way to restore block data and it is 
> permanently lost.
> Worse, when Executors try to fetch blocks from the shuffle server, the 
> shuffle server cannot location the exeutor, decides it doesn't exist, treats 
> it as a fatal exception, and causes the application to terminate and crash.
> Thus, in a real world scenario that we observe on a 1000+ node multi-tenant 
> cluster  where dynamic allocation is on by default, a rolling restart of the 
> YARN node managers will cause ALL jobs that have deallocated any executor and 
> have shuffles or transferred blocks to the shuffle server in order to shut 
> down, to crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to