mridulm commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1553434747

   > Case 2 can't be handled by existing config, there'll be other similar 
recoverable cases
   
   Did you try increasing idle timeout ?
   The behavior is specific to the environment application is running in - 
where executors are unable to respond to shuffle requests for more than 2 
minutes: this is a tuning or deployment issue.
   
   > Generally speaking, I think unregister all map output when fetch failed is 
too aggressive.
   
   As described, this is a case of not appropriately configuring spark for the 
load/cluster characteristics.
   For example, in our internal env, the network timeout is set to a 
significantly higher value than the default 120s due to a variety of factors - 
the default 2mins would result in failures (including this specific shuffle 
issue mentioned).
   
   This proposed change would complicate the way we reason about when shuffle 
data is lost - and I am hesitant about this if it is something that can be 
mitigated with appropriate tuning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to