Re: data loss / region server stopped serving.

stack Fri, 25 Jul 2008 11:23:37 -0700

Tim Sell wrote:
...

after this crash, I couldn't read the table 'catalogue'. I have other
tables that weren't being touched and they were still intact.

Whats the exception you get Tim when you try to read from this table?

For a sense of what sort of load it was under. I was sending data to
it via the thrift server in row batches. It was going as fast as hbase
could send "I've finished that row" back. This was from a single
thread. Hbase is running on a cluster of 3 machines. My process was
running on a remote machine and sending rows via thrift.
I'm not sure the exact number of rows it processed before the crash,
but it was at least 6 million. So I guess that's heavy-ish load?

Not a particularly heavy load. How big was each row? Multiplecolumns? What size values?

I've attached my region server log if you want to check it out.
I've been going through the logs, and I've noticed a few things.


First thing, the region my app complained about finished compaction,
then tried to split my table:

2008-07-25 13:09:24,447 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
region catalogue,,1216982750797 in 22sec
2008-07-25 13:09:24,448 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
catalogue,,1216982750797
2008-07-25 13:09:25,152 INFO
org.apache.hadoop.hbase.regionserver.HRegion: closed
catalogue,,1216982750797

the I get my first batchUpdate error

2008-07-25 13:09:25,155 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call batchUpdate([EMAIL PROTECTED], row => 11898254,
{column => album_track_join:album, value => '...', column =>
album_track_join:track, value => '...'}) from 10.101.1.31:41818:
error: org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
        at 
org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1698)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1351)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1151)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

then just after this, the region split thread finishes, apparently
successfully.

Yeah. If a client comes in after a split happens, it gets aNotServingRegionException. This forces it to recalibrate. There is alittle bit of a lull until the daughters-of-the-split show up on newregionservers. Usually the client figures it out not long after thedaughters have been successfully deployed and away we go again.


If it takes too long, client will error out with a NSRE.

...


2008-07-25 13:19:36,985 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
failed for region catalogue,,1216987764459
java.io.FileNotFoundException: File does not exist:
hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/data
        at 
org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:369)
        at 
org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:444)
        at 
org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:392)
        at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:218)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1618)
        at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:466)
        at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:405)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:800)
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:133)
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:86)

This is bad. A data file has gone missing. You having errors in yourHDFS logs? At least enable DEBUG level logging in hbase. See the FAQfor how.

My guess is that the region with this missing file is not able todeploy. We're stuck in a cycle where the master tells regionserver toopen the region, it fails because of the FNFE above, and then mastertries to give it to another regionserver and so on.

Is this 0.2.0? The handler which reports the above loss and then keepsgoing was just fixed, HBASE-766, in trunk.


Or to get going again, you could just remove:

hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/
hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/info/6661827288739579253/


You'll have lost data.

Yours,
St.Ack

Re: data loss / region server stopped serving.

Reply via email to