Github user tgravescs commented on the pull request:
https://github.com/apache/spark/pull/12735#issuecomment-215431514
If the disk is bad or missing there is nothing else you can do then create
a new db since as you say deleting wouldn't work.
Note I think all it does is log a message because we didn't want it to kill
the entire nodemanager, but I think we should change that. We should throw an
exception if registration doesn't work because the way the nodemanager
currently works is that it doesn't fail until someone goes to fetch the data.
If it failed quick when the executor registered with it that would be
preferable, but that is a YARN bug.
If you are going to look at the getRecoveryPath api, I think we can do it
without reflection by defining our own setRecoveryPath function in
YarnShuffleService (leave override off so it works with older versions of
hadoop). Have it default to some invalid value and if its hadoop 2.5 or
greater it will get called and set it to a real value. Then in our code we
could check to see if its set and if it is use it, if not we could fall back to
currently implementation. Note that setRecoverPath is the only one we really
need define since getRecoverPath is protected, but to be safe we might also
implement that. We can store our own path.
The only other thing here is that we may want to handle upgrading. If you
are currently running the shuffle service the ldb will be in local dirs but
when you upgrade it will go to a new path and wouldn't find the old one. To
handle this we could just look for it in the new path first and if not there
look for it in the old locations and if found then move to the new location.
I think between the throw and using the new api we shouldn't need to check
the disks. The recovery path that is supposed to be set by administrators is
supposed to be something very resilient such that if its down nothing on the
node would work.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]