[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

tgravescs Mon, 06 Mar 2017 14:24:06 -0800

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17088
  
    
    In this particular case are your map tasks fast or slow. If they are really 
fast rerunning everything now makes sense, if each of those took 1 hour+ to 
run, failing all when they don't need to be just wastes time and resource.  
Rolling upgrades can take longer then 15 seconds to restart NMs.  You can have 
intermittent issues that last > 1 minute.   If it took 1 hour to generate that 
output I want it to retry really hard before failing all of those.  Users 
aren't going to tune each individual job unless they really have to and it 
might very per stage.  Really it should use cost base analysis on how long 
those tasks ran but that gets more complicated. 
    
    It is also possible that some reduce tasks have already fetched the data 
from those nodes and succeeded and you wouldn't have to rerun all tasks on that 
host. Due to the way Spark cancels stages on fetch failure whether the reduce 
tasks from the stage finish before the map can be rerun is very timing 
dependent.  You could end up doing a lot more work then necessary so the 
question is whether that is ok compared to the cost of not doing this and 
allowing the application to take a bit longer.  Note that if you rerun all the 
maps and they didn't need to be you might cause the app to take longer too.
    
    How often do you see issues with node managers going down for other then 
transient issues? 
    We very rarely see this. Most of the issues we see are transient and have 
found with other application type (TEZ, MR) that rerunning all is worse because 
its normally a transient issue.  Those application schedulers are a bit 
different though so not 100% comparable.  If the whole node goes down then yarn 
will inform you the node is lost.  Yes that does take like 10 minutes though.  
    
     Are you seeing jobs fail due to this or just take longer? 
     I realize the job could fail is the node is really down and we get enough 
failures across stages.  It does seem like we should do better at this but I'm 
not sure invalidating all the outputs on a single fetch failure makes sense 
either.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

Reply via email to