SaurabhChawla100 commented on issue #27636: [SPARK-30873][CORE][YARN]Handling Node Decommissioning for Yarn cluster manger in Spark URL: https://github.com/apache/spark/pull/27636#issuecomment-593338415 > Thanks for working on this PR. I'm not super familiar with the YARN code path, so we should get some folks with more YARN background to look at this. That being said I'm a little confused with some elements of the design: > > 1. Why do we exit the executors before the shuffle service? I know we want to keep the blocks, but leaving the shuffle service probably blocks the maintaince or decom task so it seems not ideal. > 2. Is there a way we could just track the executors for decom like the we have basic infra for? > 3. The task fetch retry logic I don't think I understand what the expected case is here and how it is going to help. Thanks for reviewing this PR . Can you kindly add other experts for this , whom you consider valuable in reviewing this PR. Below I have tried to answer your queries 1) **Why do we exit the executors before the shuffle service? I know we want to keep the blocks, but leaving the shuffle service probably blocks the maintaince or decom task so it seems not ideal** - This is for Spark using External Shuffle Service.There are 2 reasons why we are exiting the executors before the shuffle service a) As per the current logic whenever we recived the node decomissioning we stop assiging the new task to the executor running on that node. We give some time to the task already running on that executor to complete before killing the executors. If we keep the excutors running till the end, there are chances of generating more shuffle data which will be eventually lost, triggering a recompute in future. This approach minimizes the recomputation of the shuffle data and maximise the usage of that shuffle data on the node by increasing the avilability of it till the end. b) We want to keep the shuffle data till the time where the node is about to be lost, So if there are some task that is dependent on that shuffle data can complete and we dont have to recompute the shuffle data if none of the task required the shuffle data. If the user is not using the External Shuffle Service than in that scenario we have to keep spark.graceful.decommission.executor.leasetimePct and spark.graceful.decommission.shuffedata.leasetimePct with same value to prevent fetch failure exceptions 2) **Is there a way we could just track the executors for decom like the we have basic infra for?** - The entire logic is for decommission of nodes . So the decommission tracker has the information about the decommissioned nodes. 3) **The task fetch retry logic I don't think I understand what the expected case is here and how it is going to help**. - As per the existing code this is the logic to check the abortStage `val shouldAbortStage = failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts || disallowStageRetryForTest ` There are chances that some of reason for the failure of stage is getting the fetch failed execption due to the decommissioned nodes,so we need to handle the abort stage gracefully here by dicounting the shuffle fetch failure due to decommissioned nodes to some extent. ``` val shouldAbortStage = failedStage.failedAttemptIds.size >= (maxConsecutiveStageAttempts + failedStage.ignoredFailedStageAttempts) || disallowStageRetryForTest || failedStage.ignoredFailedStageAttempts > maxIgnoredFailedStageAttempts ``` This ignoredFailedStageAttempts will be considered if there is fetch failed due to node decommissioning. By this way we are making the spark application more relaible towards such failures which cannot be controlled
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
