Re: Recovering HBase after HDFS Corruption

stack Fri, 26 Dec 2008 11:25:05 -0800

g00dn3ss wrote:

....Until HBase started failing, I
didn't even consider that HBase would have problems with me removing corrupt
files.

HBase reads all the files in an HStore on open of a region. It keepsthe list of files in memory and doesn't expect it to change over aHStore's lifecycle.

If for any reason -- hdfs hiccup -- the list that hbase has in memorystrays from whats on the filesystem because the filesystem has lost afile or lost a block in a file, then hbase will have trouble with thedamaged Store. You'll see the issue on an access, either a fetch orwhen hbase goes to compact the damaged Store. Currently, to get hbaseto reread the filesystem, the region needs to be redeployed. This meansrestart of the hosting RegionServer (on the particular host run'./bin/hbase regionserver stop' and then start). A whole regionserverrestart can be disruptive especially if the cluster is small and theregionserver is carrying lots of regions. A new tool was added to theshell in TRUNK that allows you close an individual region (In the shell,type 'tools'). On region close it'll be redeployed elsewhere by themaster. On reopen, hbase will reread the filesystem content.

Alternately, is
there a way that I can manually initialize the corrupt table regions to
empty?

In the shell there is a truncate for the whole table.

Otherwise, if you would operate at the region-level only, first try theabove trick redeploying the region. If that doesn't work, read the logfor the problematic file. Is it present? If so, if you just want toget going again, remove the bad file and then redeploy the regionagain. Use the ./bin/hadoop fsck /HBASE_ROOT to help you figure problemfiles. Cycle till you've cleared all corruption.

On Thu, Dec 25, 2008 at 4:26 AM, Andrew Purtell <[email protected]> wrote:

In general, I think it may be useful for HBase to provide
a recovery option where a corrupt table region can be
reinitialized as empty. At least the whole table will not
be lost. I have wanted something like this on occasion.
This could be a new shell tool.


Maybe one at the store level too?  So, clean_store and clean_region?

One thing you can do is schedule daily maintenance time
where you shut down your cluster and do a Hadoop distcp
from the HBase/primary cluster to a secondary DFS cluster
serving as backup media. This is akin to making a tape
backup and has the same drawback of losing all edits
subsequent to the last backup upon recovery, but on the
other hand you do not lose everything. The distcp copies
the data in reasonable parallel fashion so the backup
can complete quickly even if the tables are large.

It would be sweet, though difficult, if you didn't have to quiesce hbasemaking a backup. See https://issues.apache.org/jira/browse/HBASE-50for some kicking around of ideas.


St.Ack

Re: Recovering HBase after HDFS Corruption

Reply via email to