Ngone51 commented on pull request #29579: URL: https://github.com/apache/spark/pull/29579#issuecomment-683534969
Thank you for the quick response @agrawaldevesh . > However, with this PR, it seems you are removing the "clear shuffle on fetch failure" part. It seems that you will wait for the heartbeat failure to occur and the host be lost, even if the downstream has signaled fetch failure. I think this PR doesn't change the semantics. We still clear shuffle status on fetch failure as you can see the only change for fetch failure in DAGScheduler is: ```java - .exists(_.isHostDecommissioned) + .exists(_.hostOpt.isDefined) ``` It the fetch failure comes first before the executor lost, DAGScheduler will still ask TaskSchedulerImpl for the decommission state and unregister the shuffle status then. While if the executor lost comes first, fetch failure becomes a NoOp on shuffle status unregister. I think the only difference is that, before this PR, if the executor lost event comes first, it can only unregister shuffle map status on that executor, even if we know the host is also decommissioned. But now we can unregister the host shuffle status because we pass in the host info directly. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
