mridulm commented on PR #41083: URL: https://github.com/apache/spark/pull/41083#issuecomment-1553434747
> Case 2 can't be handled by existing config, there'll be other similar recoverable cases Did you try increasing idle timeout ? The behavior is specific to the environment application is running in - where executors are unable to respond to shuffle requests for more than 2 minutes: this is a tuning or deployment issue. > Generally speaking, I think unregister all map output when fetch failed is too aggressive. As described, this is a case of not appropriately configuring spark for the load/cluster characteristics. For example, in our internal env, the network timeout is set to a significantly higher value than the default 120s due to a variety of factors - the default 2mins would result in failures (including this specific shuffle issue mentioned). This proposed change would complicate the way we reason about when shuffle data is lost - and I am hesitant about this if it is something that can be mitigated with appropriate tuning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org