Thanks for the help. I am running a secondary name node. I didn't initially restore from the secondary because I was able to get things working from the primary name node. I had the (probably mistaken) impression that restoring from the secondary name node was a last resort only to be used when the primary name node couldn't be recovered. Until HBase started failing, I didn't even consider that HBase would have problems with me removing corrupt files. At that point, I thought it was probably too late to try the secondary, since the secondary was probably reflecting my fsck changes on the primary. I guess I will try recovering from the secondary since it sounds like I will otherwise lose the whole table anyway. Alternately, is there a way that I can manually initialize the corrupt table regions to empty?
Thanks again! On Thu, Dec 25, 2008 at 4:26 AM, Andrew Purtell <[email protected]> wrote: > First, were you running a secondary data node? Did you > follow the Hadoop instructions for recovering a fs image > from the secondary? Is it too late for you to try it? > > In general, I think it may be useful for HBase to provide > a recovery option where a corrupt table region can be > reinitialized as empty. At least the whole table will not > be lost. I have wanted something like this on occasion. > This could be a new shell tool. > > One thing you can do is schedule daily maintenance time > where you shut down your cluster and do a Hadoop distcp > from the HBase/primary cluster to a secondary DFS cluster > serving as backup media. This is akin to making a tape > backup and has the same drawback of losing all edits > subsequent to the last backup upon recovery, but on the > other hand you do not lose everything. The distcp copies > the data in reasonable parallel fashion so the backup > can complete quickly even if the tables are large. > > - Andy > > > From: g00dn3ss <[email protected]> > > Subject: Recovering HBase after HDFS Corruption > > To: [email protected] > > Date: Wednesday, December 24, 2008, 10:40 PM > > Hi All, > > > > We had a hardware failure on our namenode that led to > > corruption in our DFS. I ran an fsck and moved the > > corrupted files to a lost+found directory. The DFS > > now seems to run fine by itself. However, if I run > > HBase following the fsck, I get a bunch of FileNotFound > > exceptions as it tries to access some of the files > > that were corrupted. This ultimately seems to lead to > > the HMaster getting in a bad state where it doesn't > > respond. > > > > So I'm wondering if there is a way to recover from my > > current state. > [...] > > > > >
