[ https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394905#comment-17394905 ]
Adam Kennedy commented on SPARK-36446: -------------------------------------- The problem was particularly amplified by the Executor deallocation improvements in SPARK-27677 that came about as a result of SPARK 25888. > YARN shuffle server restart crashes all dynamic allocation jobs that have > deallocated an executor > ------------------------------------------------------------------------------------------------- > > Key: SPARK-36446 > URL: https://issues.apache.org/jira/browse/SPARK-36446 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 2.4.8, 3.1.2 > Reporter: Adam Kennedy > Priority: Critical > > When dynamic allocation is enabled, executors that deallocate rely on the > shuffle server to hold blocks and supply them to remaining executors. > When YARN Shuffle Server restarts (either intentionally or due to a crash), > it loses block information and relies on being able to contact Executors (the > locations of which it durably stores) to refetch the list of blocks. > This mutual dependency on the other to hold block information fails fatally > under some common scenarios. > For example, if a Spark application is running under dynamic allocation, some > amount of executors will almost always shut down. > If, after this has occurred, any shuffle server crashes, or is restarted > (either directly when running as a standalone service, or as part of a YARN > node manager restart) then there is no way to restore block data and it is > permanently lost. > Worse, when Executors try to fetch blocks from the shuffle server, the > shuffle server cannot location the exeutor, decides it doesn't exist, treats > it as a fatal exception, and causes the application to terminate and crash. > Thus, in a real world scenario that we observe on a 1000+ node multi-tenant > cluster where dynamic allocation is on by default, a rolling restart of the > YARN node managers will cause ALL jobs that have deallocated any executor and > have shuffles or transferred blocks to the shuffle server in order to shut > down, to crash. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org