[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

tgravescs Thu, 02 Mar 2017 06:12:43 -0800

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @jerryshao  are you actually seeing issues with this on real 
customer/production jobs?  How often? NM failure for us is very rare.  I'm not 
familiar with how mesos would fail differently, the shuffle service there is 
started as a separate service correct? 
    
    We would definitely need to make sure that Spark's retry before really 
returning fetch failure are good enough to handle the cases with rolling 
upgrade or intermittent issues with the shuffle, but with our defaults of 3 
retries at 5 seconds each not sure that would cover it.  
    
    If people are seeing a need for it and actual issues with it then I'm ok 
with the idea as long as we make it configurable to turn it off.  I'm not sure 
on the blacklist after one fetch failure either.  It would seem better to only 
blacklist after a couple of things have gotten the failure that way you would 
have a better confidence it was really an issue with the node you are fetching 
from.
    
    cc @squito



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17113: [SPARK-13669][Core] Improve the blacklist mechanism to h...

Reply via email to