Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/11254#issuecomment-189417207
  
    But if the scheduler behaves the way I expect, then the fetch failures 
caused by a single node's failure will not produce consecutive failures, but 
rather fetch failures associated with a single stageAttempt.  Subsequent 
stageAttempts should end up attempting to fetch the map outputs from a 
different location than the failed node.  If that is all working as expected, 
then four consecutive fetch failures would either indicate that four 
consecutive attempts to fetch the map outputs, each attempt seeking from a 
different location, all failed through no fault of the Stage's Tasks (very 
unlikely), or that the Stage's Tasks are causing problems on whichever Executor 
they run (more likely).
    
    If all that is not working as expected, then increasing 
MAX_CONSECUTIVE_FETCH_FAILURES would seem to be compensating for 
broken/unexpected behavior instead of fixing the actual underlying problem -- 
or I need to improve my understanding and expectations.  Either way, it's 
probably going to be another week before I get a chance to spend any serious 
time on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to