[GitHub] [spark] SaurabhChawla100 commented on issue #27636: [SPARK-30873][CORE][YARN]Handling Node Decommissioning for Yarn cluster manger in Spark

GitBox Mon, 02 Mar 2020 02:42:20 -0800

SaurabhChawla100 commented on issue #27636: [SPARK-30873][CORE][YARN]Handling 
Node Decommissioning for Yarn cluster manger in Spark
URL: https://github.com/apache/spark/pull/27636#issuecomment-593338415
 
 
   > Thanks for working on this PR. I'm not super familiar with the YARN code 
path, so we should get some folks with more YARN background to look at this. 
That being said I'm a little confused with some elements of the design:
   > 
   > 1. Why do we exit the executors before the shuffle service? I know we want 
to keep the blocks, but leaving the shuffle service probably blocks the 
maintaince or decom task so it seems not ideal.
   > 2. Is there a way we could just track the executors for decom like the we 
have basic infra for?
   > 3. The task fetch retry logic I don't think I understand what the expected 
case is here and how it is going to help.
   
   Thanks for reviewing this PR . Can you kindly add other experts for this , 
whom you consider valuable in reviewing this PR. Below I have tried to answer 
your queries
   
   1) **Why do we exit the executors before the shuffle service? I know we want 
to keep the blocks, but leaving the shuffle service probably blocks the 
maintaince or decom task so it seems not ideal** - This is for Spark using 
External Shuffle Service.There are 2 reasons why we are exiting the executors 
before the shuffle service
                a) As per the current logic whenever we recived the node 
decomissioning we stop assiging the new task to the executor running on that 
node. We give some time to the task already running on that executor to 
complete before killing the executors. If we keep the excutors running till the 
end, there are chances of generating more shuffle data which will be eventually 
lost, triggering a recompute in future. This approach minimizes the 
recomputation of the shuffle data and maximise the usage of that shuffle data 
on the node by increasing the avilability of it till the end. 
                            b) We want to keep the shuffle data till the time 
where the node is about to be lost, So if there are some task that is dependent 
on that shuffle data can complete and we dont have to recompute the shuffle 
data if none of the task required the shuffle data.  
   
   If the user is not using the External Shuffle Service than in that scenario 
we have to keep spark.graceful.decommission.executor.leasetimePct and 
spark.graceful.decommission.shuffedata.leasetimePct with same value to prevent 
fetch failure exceptions
                              
   2) **Is there a way we could just track the executors for decom like the we 
have basic infra for?** - The entire logic is for decommission of nodes . So 
the decommission tracker has the information about the decommissioned nodes.
   
   3) **The task fetch retry logic I don't think I understand what the expected 
case is here and how it is going to help**. - As per the existing code this is 
the logic to check the  abortStage
   
   `val shouldAbortStage = failedStage.failedAttemptIds.size >= 
maxConsecutiveStageAttempts || disallowStageRetryForTest
   `
   
   There are chances that some of reason for the failure of stage is getting 
the fetch failed execption due to the decommissioned nodes,so we need to handle 
the abort stage gracefully here by dicounting the shuffle fetch failure due to 
decommissioned nodes to some extent.
   
   ```
   val shouldAbortStage = failedStage.failedAttemptIds.size >=
                 (maxConsecutiveStageAttempts + 
failedStage.ignoredFailedStageAttempts) ||
                 disallowStageRetryForTest ||
                 failedStage.ignoredFailedStageAttempts > 
maxIgnoredFailedStageAttempts
   ```
   
   This ignoredFailedStageAttempts will be considered if there is fetch failed 
due to node decommissioning. By this way we are making the spark application 
more relaible towards such failures which cannot be controlled


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] SaurabhChawla100 commented on issue #27636: [SPARK-30873][CORE][YARN]Handling Node Decommissioning for Yarn cluster manger in Spark

Reply via email to