agrawaldevesh commented on a change in pull request #29014:
URL: https://github.com/apache/spark/pull/29014#discussion_r457791246



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1767,10 +1767,18 @@ private[spark] class DAGScheduler(
 
           // TODO: mark the executor as failed only if there were lots of 
fetch failures on it
           if (bmAddress != null) {
-            val hostToUnregisterOutputs = if 
(env.blockManager.externalShuffleServiceEnabled &&
-              unRegisterOutputOnHostOnFetchFailure) {
-              // We had a fetch failure with the external shuffle service, so 
we
-              // assume all shuffle data on the node is bad.
+            val externalShuffleServiceEnabled = 
env.blockManager.externalShuffleServiceEnabled
+            val isHostDecommissioned = taskScheduler
+              .getExecutorDecommissionInfo(bmAddress.executorId)
+              .exists(_.isHostDecommissioned)
+            // Host shuffle data is considered lost if:
+            // - If we know that the host was decommissioned
+            // - Or when `unRegisterOutputOnHostOnFetchFailure` is enabled and 
we had
+            //   a fetch failure with the external shuffle service, so we 
assume all
+            //   shuffle data on the node is bad.
+            val hostLost = isHostDecommissioned || 
(externalShuffleServiceEnabled &&

Review comment:
       I think the key issue is that the unregisterOutputOnHost is poorly named 
but I think that should be fixed in a separate PR.
   
   A better name/description for this config should be: "assume that all 
executor outputs on host are lost if any executor triggers fetch failure with 
an external shuffle service". Basically its just a "Hint" that is saying when 
to treat the executor loss as the host loss.
   
   To which, I introduce another wrinkle: Shuffle output on the host is now 
considered unconditionally lost if we know that the host is decommissioned OR 
this above hint is enabled. Note that I am not introducing any separate config 
for the first part of this disjunction. 
   
   I really don't think I am perverting the original meaning of 
unRegisterOutputOnHost. I am simply tacking an additional condition on top of 
it.
   
   I have tried to rework the code with some comments to illustrate the logic. 
But please do help me understand what would be a better description/name here. 
Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to