Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17543
  
    In theory (as you may know), the way this is supposed to work is that, 
since each reduce task reads the map outputs in random order, we delay 
re-scheduling the earlier stage, to try to collect as many failures as possible 
(and so you don't need 1 stage failure for each failed map task).
    
    But I agree that in general things don't work well when there are lots of 
fetch failures, which is what https://issues.apache.org/jira/browse/SPARK-20178 
is tracking.  I'm not yet convinced that this is the most important fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to