[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

tgravescs Thu, 28 Apr 2016 06:57:18 -0700

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215431514
  
    If the disk is bad or missing there is nothing else you can do then create 
a new db since as you say deleting wouldn't work. 
    
    Note I think all it does is log a message because we didn't want it to kill 
the entire nodemanager, but I think we should change that. We should throw an 
exception if registration doesn't work because the way the nodemanager 
currently works is that it doesn't fail until someone goes to fetch the data.  
If it failed quick when the executor registered with it that would be 
preferable, but that is a YARN bug.
    
    If you are going to look at the getRecoveryPath api, I think we can do it 
without reflection by defining our own setRecoveryPath function in 
YarnShuffleService (leave override off so it works with older versions of 
hadoop).  Have it default to some invalid value and if its hadoop 2.5 or 
greater it will get called and set it to a real value.  Then in our code we 
could check to see if its set and if it is use it, if not we could fall back to 
currently implementation.  Note that setRecoverPath is the only one we really 
need define since getRecoverPath is protected, but to be safe we might also 
implement that. We can store our own path.
    The only other thing here is that we may want to handle upgrading. If you 
are currently running the shuffle service the ldb will be in local dirs but 
when you upgrade it will go to a new path and wouldn't find the old one. To 
handle this we could just look for it in the new path first and if not there 
look for it in the old locations and if found then move to the new location.
    
    I think between the throw and using the new api we shouldn't need to check 
the disks.   The recovery path that is supposed to be set by administrators is 
supposed to be something very resilient such that if its down nothing on the 
node would work.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Reply via email to