agrawaldevesh edited a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-643552975
@holdenk, what are your thoughts on reducing the number of fetch failed exceptions (and thus potential job failures) due to decommissioning: - This PR tries to offload the shuffle blocks to another peer but that is somewhat best effort. - When the executor is eventually lost (detected much later by a heartbeat timeout), its shuffle files will be cleared. Until that happens the unmigrated blocks (for whatever reason) will result in fetch failures. Block migration may not be possible in all cases since we may not have enough time to do the migration or the peer's disk might be full or it might soon after be decommissioned itself. This sort of "bulk decommissioning" can happen for example during a cluster scale down where many nodes are brought down in quick succession. While block migration is great in that it avoids wasting work, we cannot guarantee it and I think we need a second line of defense by proactively calling `MapOutputTracker.removeOutputsOnExecutor` (and clearing `DAGScheduler.cacheLocs`) perhaps after some timeout. (I think FetchFailures may still happen if a reducer has already started.) Another way to solve this problem is for the executor to cleanly exit and tell the driver that it is going away, so that the driver can forget about its (unmigrated) shuffle blocks. But even this is somewhat best effort since the cloud provider may not give it time to do a clean shutdown. It would be nice to prevent/reduce job failures caused by fetch failures: Particularly because we already know that the executor would be yanked soon, so we should be able to do better. All these three approaches to preventing fetch failures are orthogonal and we probably should allow users to enable each one of them independently: * Offloading shuffle blocks * Clearing shuffle statuses * Eager and clean termination of the executor. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
