Re: data loss / region server stopped serving.

Tim Sell Fri, 25 Jul 2008 11:54:04 -0700

2008/7/25 stack <[EMAIL PROTECTED]>:
> Tim Sell wrote:
> ...
>>
>> after this crash, I couldn't read the table 'catalogue'. I have other
>> tables that weren't being touched and they were still intact.
>>
>
> Whats the exception you get Tim when you try to read from this table?


I just tried to do scan 'catalogue' from the shell and this was in the log.
2008-07-25 14:19:41,599 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
scanner (fsOk: true)
java.io.IOException: Region catalogue,,1216987764459 closed
        at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1311)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1210)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
2008-07-25 14:19:41,600 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call openScanner([EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], 9223372036854775807, null) from 10.101.1.31:59232: error:
java.io.IOException: Region catalogue,,1216987764459 closed
java.io.IOException: Region catalogue,,1216987764459 closed
        at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1311)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1210)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

After a restart of hbase I didn't get an exception when doing a scan,
but I'm guess that it lost the regions it couldn't open.

>>
>> For a sense of what sort of load it was under. I was sending data to
>> it via the thrift server in row batches. It was going as fast as hbase
>> could send "I've finished that row" back. This was from a single
>> thread. Hbase is running on a cluster of 3 machines. My process was
>> running on a remote machine and sending rows via thrift.
>> I'm not sure the exact number of rows it processed before the crash,
>> but it was at least 6 million. So I guess that's heavy-ish load?
>>
>
> Not a particularly heavy load.  How big was each row?  Multiple columns?
>  What size values?

Tiny data per row. Each row mutation was only two columns. each value
was just a short string of say 10 characters, as bytes. The row keys
were also short. it had about another 200 million rows to go though.

>>
>> I've attached my region server log if you want to check it out.
>> I've been going through the logs, and I've noticed a few things.
>>
>> First thing, the region my app complained about finished compaction,
>> then tried to split my table:
>>
>> 2008-07-25 13:09:24,447 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>> region catalogue,,1216982750797 in 22sec
>> 2008-07-25 13:09:24,448 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
>> catalogue,,1216982750797
>> 2008-07-25 13:09:25,152 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: closed
>> catalogue,,1216982750797
>>
>> the I get my first batchUpdate error
>>
>> 2008-07-25 13:09:25,155 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 7 on 60020, call batchUpdate([EMAIL PROTECTED], row => 11898254,
>> {column => album_track_join:album, value => '...', column =>
>> album_track_join:track, value => '...'}) from 10.101.1.31:41818:
>> error: org.apache.hadoop.hbase.NotServingRegionException: Region
>> catalogue,,1216982750797 closed
>> org.apache.hadoop.hbase.NotServingRegionException: Region
>> catalogue,,1216982750797 closed
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1698)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1351)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1151)
>>        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>>
>> then just after this, the region split thread finishes, apparently
>> successfully.
>>
>
> Yeah.  If a client comes in after a split happens, it gets a
> NotServingRegionException.  This forces it to recalibrate.  There is a
> little bit of a lull until the daughters-of-the-split show up on new
> regionservers.  Usually the client figures it out not long after the
> daughters have been successfully deployed and away we go again.
>
> If it takes too long, client will error out with a NSRE.

Weird, the exception I get says it retried 5 times, I guess that's the
thrift default, I'm not sure how long it waits between retries. Should
I catch this exception and force a retry in my program myself then?

> ...
>>
>> 2008-07-25 13:19:36,985 ERROR
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
>> failed for region catalogue,,1216987764459
>> java.io.FileNotFoundException: File does not exist:
>>
>> hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/data
>>        at
>> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:369)
>>        at
>> org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:444)
>>        at
>> org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:392)
>>        at
>> org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:218)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1618)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:466)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:405)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:800)
>>        at
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:133)
>>        at
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:86)
>>
>
> This is bad.  A data file has gone missing.  You having errors in your HDFS
> logs?  At least enable DEBUG level logging in hbase.  See the FAQ for how.
>
> My guess is that the region with this missing file is not able to deploy.
>  We're stuck in a cycle where the master tells regionserver to open the
> region, it fails because of the FNFE above, and then master tries to give it
> to another regionserver and so on.
>
> Is this 0.2.0?  The handler which reports the above loss and then keeps
> going was just fixed, HBASE-766, in trunk.
>
> Or to get going again, you could just remove:
>
> hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/
> hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/info/6661827288739579253/
>
>
> You'll have lost data.
>
> Yours,
> St.Ack
>

in hadoop datanode log (for node 1) I'm getting:
2008-07-25 12:53:57,258 ERROR org.apache.hadoop.dfs.DataNode:
10.101.1.31:50010:DataXceiver: java.net.SocketTimeoutException: 480000
millis timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.101.1.31:50010
remote=/10.101.1.31:44584]

then a bunch of connection reset by peer errors, eg
2008-07-25 13:09:25,962 ERROR org.apache.hadoop.dfs.DataNode:
10.101.1.31:50010:DataXceiver: java.io.IOException: Connection reset
by peer

then more socket timeout exceptions.
I haven't checked the other nodes.

I'll have to do that and turn debug mode on in hbase on Monday.
We don't really mind the data loss at this stage, we're practicing heh.

Another little thing I noticed, in the log, we get things like call
batchUpdate([EMAIL PROTECTED]
I suppose bit at the end is a table name, this the same as the stuff
getTableNames was returning before the fix, I'm guessing another
leftover from the switch to byte arrays. If I remember I might try and
track it down.

On Monday I'll enable hbase debug and hopefully try replicate the error.

~Tim.

Re: data loss / region server stopped serving.

Reply via email to