> HBase reads all the files in an HStore on open of a region. It keeps the > list of files in memory and doesn't expect it to change over a HStore's > lifecycle. > > If for any reason -- hdfs hiccup -- the list that hbase has in memory > strays from whats on the filesystem because the filesystem has lost a file > or lost a block in a file, then hbase will have trouble with the damaged > Store. You'll see the issue on an access, either a fetch or when hbase goes > to compact the damaged Store. Currently, to get hbase to reread the > filesystem, the region needs to be redeployed. This means restart of the > hosting RegionServer (on the particular host run './bin/hbase regionserver > stop' and then start). A whole regionserver restart can be disruptive > especially if the cluster is small and the regionserver is carrying lots of > regions. A new tool was added to the shell in TRUNK that allows you close > an individual region (In the shell, type 'tools'). On region close it'll be > redeployed elsewhere by the master. On reopen, hbase will reread the > filesystem content. >
OK, I did some more experimenting. Just to be clear, there are no more corrupt files in my HDFS. I ran an fsck and moved all the corrupt files to lost+found. I looked in my "lost+found" folder to see which files the HDFS thought were corrupt. All the files were from subfolders under /hbase/<my-table-name>. I just wanted to be sure that it wasn't the -ROOT- or .META. regions that were corrupted. So the thing that doesn't make sense to me in your above statement is that I stop the entire HBase instance including the master. Then I restart the system. Some of the regionservers get "FileNotFound" exceptions when looking for some of the corrupted files. Then the affected regionservers shut down. So I don't understand how the problem I'm seeing could be caused by having something in memory that doesn't match what's on disk if I am starting the entire system from scratch. The other issue that causes further problems in this case is when one of these problematic regions is on the same regionserver as the -ROOT- region. When the regionserver holding the -ROOT- region crashes, the entire system seems to go down. Is this what http://issues.apache.org/jira/browse/HBASE-1080 is about?
