First, were you running a secondary data node? Did you follow the Hadoop instructions for recovering a fs image from the secondary? Is it too late for you to try it?
In general, I think it may be useful for HBase to provide a recovery option where a corrupt table region can be reinitialized as empty. At least the whole table will not be lost. I have wanted something like this on occasion. This could be a new shell tool. One thing you can do is schedule daily maintenance time where you shut down your cluster and do a Hadoop distcp from the HBase/primary cluster to a secondary DFS cluster serving as backup media. This is akin to making a tape backup and has the same drawback of losing all edits subsequent to the last backup upon recovery, but on the other hand you do not lose everything. The distcp copies the data in reasonable parallel fashion so the backup can complete quickly even if the tables are large. - Andy > From: g00dn3ss <[email protected]> > Subject: Recovering HBase after HDFS Corruption > To: [email protected] > Date: Wednesday, December 24, 2008, 10:40 PM > Hi All, > > We had a hardware failure on our namenode that led to > corruption in our DFS. I ran an fsck and moved the > corrupted files to a lost+found directory. The DFS > now seems to run fine by itself. However, if I run > HBase following the fsck, I get a bunch of FileNotFound > exceptions as it tries to access some of the files > that were corrupted. This ultimately seems to lead to > the HMaster getting in a bad state where it doesn't > respond. > > So I'm wondering if there is a way to recover from my > current state. [...]
