Tim Sell wrote:
...
after this crash, I couldn't read the table 'catalogue'. I have other
tables that weren't being touched and they were still intact.
Whats the exception you get Tim when you try to read from this table?
For a sense of what sort of load it was under. I was sending data to
it via the thrift server in row batches. It was going as fast as hbase
could send "I've finished that row" back. This was from a single
thread. Hbase is running on a cluster of 3 machines. My process was
running on a remote machine and sending rows via thrift.
I'm not sure the exact number of rows it processed before the crash,
but it was at least 6 million. So I guess that's heavy-ish load?
Not a particularly heavy load. How big was each row? Multiple
columns? What size values?
I've attached my region server log if you want to check it out.
I've been going through the logs, and I've noticed a few things.
First thing, the region my app complained about finished compaction,
then tried to split my table:
2008-07-25 13:09:24,447 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
region catalogue,,1216982750797 in 22sec
2008-07-25 13:09:24,448 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
catalogue,,1216982750797
2008-07-25 13:09:25,152 INFO
org.apache.hadoop.hbase.regionserver.HRegion: closed
catalogue,,1216982750797
the I get my first batchUpdate error
2008-07-25 13:09:25,155 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call batchUpdate([EMAIL PROTECTED], row => 11898254,
{column => album_track_join:album, value => '...', column =>
album_track_join:track, value => '...'}) from 10.101.1.31:41818:
error: org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
at
org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1698)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1351)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1151)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
then just after this, the region split thread finishes, apparently
successfully.
Yeah. If a client comes in after a split happens, it gets a
NotServingRegionException. This forces it to recalibrate. There is a
little bit of a lull until the daughters-of-the-split show up on new
regionservers. Usually the client figures it out not long after the
daughters have been successfully deployed and away we go again.
If it takes too long, client will error out with a NSRE.
...
2008-07-25 13:19:36,985 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
failed for region catalogue,,1216987764459
java.io.FileNotFoundException: File does not exist:
hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/data
at
org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:369)
at
org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:444)
at
org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:392)
at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:218)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1618)
at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:466)
at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:405)
at
org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:800)
at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:133)
at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:86)
This is bad. A data file has gone missing. You having errors in your
HDFS logs? At least enable DEBUG level logging in hbase. See the FAQ
for how.
My guess is that the region with this missing file is not able to
deploy. We're stuck in a cycle where the master tells regionserver to
open the region, it fails because of the FNFE above, and then master
tries to give it to another regionserver and so on.
Is this 0.2.0? The handler which reports the above loss and then keeps
going was just fixed, HBASE-766, in trunk.
Or to get going again, you could just remove:
hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/
hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/info/6661827288739579253/
You'll have lost data.
Yours,
St.Ack