[GitHub] [spark] agrawaldevesh commented on a change in pull request #29014: [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning

GitBox Tue, 21 Jul 2020 11:15:26 -0700


agrawaldevesh commented on a change in pull request #29014:
URL: https://github.com/apache/spark/pull/29014#discussion_r458295401




##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1767,8 +1767,13 @@ private[spark] class DAGScheduler(
 
           // TODO: mark the executor as failed only if there were lots of 
fetch failures on it
           if (bmAddress != null) {
-            val hostToUnregisterOutputs = if 
(env.blockManager.externalShuffleServiceEnabled &&
-              unRegisterOutputOnHostOnFetchFailure) {
+            val externalShuffleServiceEnabled = 
env.blockManager.externalShuffleServiceEnabled
+            val isHostDecommissioned = taskScheduler
+              .getExecutorDecommissionInfo(bmAddress.executorId)
+              .exists(_.isHostDecommissioned)

Review comment:
       I thought a bit more about this and I think that the author of #28911 
should implement the optimization of whether or not a shuffle state should be 
cleared when you get a fetchf failure from an executor -- when other "local" 
executors might serve those outputs. 
   
   This is an optimization in that it leverages the Local fetch feature 
introduced in #28911. I don't know enough about local fetch implementation to 
comment more about it.
   
   From my perspective of decommissioning, I have changed the logic to be: 
Executors on the host are fate shared if either we know that they share an 
external shuffle service or if we know that the host has been decommissioned. 
In that case, mark the entire host as lost (if the feature flag 
`unRegisterOutputOnHostOnFetchFailure` is enabled).
   
   cc: @Ngone51 (author of #28911)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] agrawaldevesh commented on a change in pull request #29014: [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning

Reply via email to