[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

sitalkedia Tue, 07 Mar 2017 10:41:53 -0800

Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17088
  
    >> Rolling upgrades can take longer then 15 seconds to restart NMs. You can 
have intermittent issues that last > 1 minute. If it took 1 hour to generate 
that output I want it to retry really hard before failing all of those. Users 
aren't going to tune each individual job unless they really have to and it 
might very per stage. 
    
    How often are you doing rolling upgrades? I really think in those cases we 
should be tuning the shuffle fetch configurations to allow for rolling 
upgrades.  Even without this change, current model un-registers files in case 
of fetch failure, so you might be losing a lot of work already and in worst 
case you can still loose all the files present in a host. 
    
    >> It is also possible that some reduce tasks have already fetched the data 
from those nodes and succeeded and you wouldn't have to rerun all tasks on that 
host. 
    
    I am not sure if I get your point here, but this will not rerun reduce 
tasks that are already fetched data from those nodes and succeeded.
    
    We are seeing this issue very frequently mainly because of node reboot. We 
are trying to scale Spark to cluster with thousands of machines and probability 
of seeing failures and reboots during long running jobs is very high. 
    
    >> Are you seeing jobs fail due to this or just take longer?
    
    We are seeing both. As @squito mentioned, we only give 4 chances for a 
stage to be retried so even one node reboot can trigger 4 retries and cause the 
job to fail. In case the job gets lucky, the job takes significantly longer 
than expected in case of fetch failure because of multiple retries of a stage 
and the way retries are handled by the scheduler is not elegant right now - It 
does not allow multiple attempts of a stage to run concurrently, which is a 
separate issue I will address in another PR.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Reply via email to