Re: Recovering HBase after HDFS Corruption

g00dn3ss Tue, 30 Dec 2008 23:07:43 -0800

I am running Hadoop 0.18.2 and HBase 0.18.1.

On Mon, Dec 29, 2008 at 1:37 PM, stack <[email protected]> wrote:

> Agreed. Something else is going on. Can we see logs?
>

I get a never ending stream of log messages on the master about assigning
regions to regionservers. Those messages look like this:

2008-12-23 13:52:40,451 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region
1212033026129,com.seekingalpha/article/105786-risks-remain-but-iphone-s-fundamentals-should-help-apple-surpass-expectations-rbc-analyst?source=feed,1228437115879
to server ...
...
2008-12-23 13:52:38,073 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_PROCESS_OPEN:
1212033026129,com.seekingalpha/article/105786-risks-remain-but-iphone-s-fundamentals-should-help-apple-surpass-expectations-rbc-analyst?source=feed,1228437115879
from...
...
2008-12-23 13:52:38,074 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_CLOSE:
1212033026129,com.seekingalpha/article/105786-risks-remain-but-iphone-s-fundamentals-should-help-apple-surpass-expectations-rbc-analyst?source=feed,1228437115879:
[...@576504fa from...

On the region servers, it apparently tries to open the various regions and
gets a FileNotFound exception like the one below.

2008-12-23 10:18:23,506 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
1212033026129,com.seekingalpha/article/105786-risks-remain-but-iphone-s-fundamentals-should-help-apple-surpass-expectations-rbc-analyst?source=feed,1228437115879
2008-12-23 10:18:23,622 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: error opening region
1212033026129,com.seekingalpha/article/105786-risks-remain-but-iphone-s-fundamentals-should-help-apple-surpass-expectations-rbc-analyst?source=feed,1228437115879
java.io.FileNotFoundException: File does not exist:
hdfs://...1212033026129/1132874927/contents/mapfiles/8447584263254958332/data
at
org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:394)
at
org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:447)
at
org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:447)
at
org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:226)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1728)
at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:469)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:911)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:883)
at
org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:823)
at java.lang.Thread.run(Thread.java:595)

I guess it's missing some important file that I deleted when doing my fsck.
I saw previously that HBase doesn't like 0 length files in this thread:

http://markmail.org/message/rqpx5egahd6uovmn#query:hbase%20hlog%20files+page:2+mid:iqmtqny5k75hmok3+state:results

I guess HBase also has problems if either of the data or index files is
missing for a MapFile? If that is the case, I don't think I can patch my
table up by hand. I looked in the lost+found directory and counted around
1900 files deleted by fsck. So I guess I'd have to write some kind of tool
to patch things up.

> System should recover when region hosting -ROOT- goes down. Which version
> of hbase (pardon me if you've already said which version)?
>

I see in the follow up message that this was addressed by HBASE-927 for
HBase version 0.18.1.

If I am understanding the problem correctly now, I have a more general
question about the HBase architecture. It seems like HBase is deleting and
rewriting large portions of the table's data. This seems to introduce a
reliability concern that multiplies any concerns about the reliability of
the DFS itself. Maybe this is a known issue and that's why you're
suggesting that I back up my whole table. Just curious if there are any
plans to address reliability for failure cases like this. HBASE-50 seems to
be all about taking snapshots - which isn't really practical for us at the
moment.

Thanks!

Re: Recovering HBase after HDFS Corruption

Reply via email to