Re: Recovering HBase after HDFS Corruption

g00dn3ss Mon, 29 Dec 2008 10:28:18 -0800

> HBase reads all the files in an HStore on open of a region.   It keeps the
> list of files in memory and doesn't expect it to change over a HStore's
> lifecycle.
>
> If for any reason -- hdfs hiccup -- the list that hbase has in memory
> strays from whats on the filesystem because the filesystem has lost a file
> or lost a block in a file, then hbase will have trouble with the damaged
> Store.  You'll see the issue on an access, either a fetch or when hbase goes
> to compact the damaged Store.  Currently, to get hbase to reread the
> filesystem, the region needs to be redeployed.  This means restart of the
> hosting RegionServer (on the particular host run './bin/hbase regionserver
> stop' and then start).  A whole regionserver restart can be disruptive
> especially if the cluster is small and the regionserver is carrying lots of
> regions.   A new tool was added to the shell in TRUNK that allows you close
> an individual region (In the shell, type 'tools').  On region close it'll be
> redeployed elsewhere by the master.  On reopen, hbase will reread the
> filesystem content.
>


OK, I did some more experimenting.  Just to be clear, there are no more
corrupt files in my HDFS.  I ran an fsck and moved all the corrupt files to
lost+found.  I looked in my "lost+found" folder to see which files the HDFS
thought were corrupt.  All the files were from subfolders under
/hbase/<my-table-name>.  I just wanted to be sure that it wasn't the -ROOT-
or .META. regions that were corrupted.

So the thing that doesn't make sense to me in your above statement is that I
stop the entire HBase instance including the master.  Then I restart the
system.  Some of the regionservers get "FileNotFound" exceptions when
looking for some of the corrupted files.  Then the affected regionservers
shut down.   So I don't understand how the problem I'm seeing could be
caused by having something in memory that doesn't match what's on disk if I
am starting the entire system from scratch.

The other issue that causes further problems in this case is when one of
these problematic regions is on the same regionserver as the -ROOT- region.
When the regionserver holding the -ROOT- region crashes, the entire system
seems to go down.  Is this what
http://issues.apache.org/jira/browse/HBASE-1080 is about?

Re: Recovering HBase after HDFS Corruption

Reply via email to