Oh, yeah, could be the OOME... But is OOME in regionserver or in
Master?  If in regionserver, I'd think that splitting we'd skip the
incomplete record.  Something is going on here.  I'd like to figure
it.  The flag should get you going in that you'll recover all up to
the last edit.

St.Ack


On Mon, Sep 27, 2010 at 11:48 AM, Jack Levin <[email protected]> wrote:
> Yay, my first bug.  I have not tried the flag yet (thanks), I will do
> that shortly.   I was not able to find those files are created.  Its
> likely the last addition not flushed properly as the server runs out
> of memory, so may as well call it lost data (in our case this is fine,
> we have source data permanently stored).
>
> -Jack
>
> On Mon, Sep 27, 2010 at 11:02 AM, Stack <[email protected]> wrote:
>> I took a look.  When we go to read the last edit in each file, we
>> overshoot.  Its as though the last addition is not properly flushed.
>> I looked at the writing of the recovered.edits file contents and we're
>> using straight sequencefile#close so flush should be going on. You
>> pasted again from regionserver log.  Could you look for these files in
>> master log, where they were created, and see if any exceptions around
>> their creation? (I've been playing over here w/ crashing servers and I
>> don't have the issues you are running into over here on replay of
>> recovered edits -- I'll play some more....)
>>
>> Did you try setting the flag suggested in previous email?
>>
>> Thanks Jack,
>> St.Ack
>>
>>
>> On Sun, Sep 26, 2010 at 5:36 PM, Jack Levin <[email protected]> wrote:
>>> They were about 64MB, I will put them somewhere... like here:
>>> http://img2.imageshack.us/0000000000618601094
>>> http://img2.imageshack.us/0000000000618601136
>>>
>>> They are not gzippable, sorry... full of jpeg data I think.
>>> Here is an error snipper from master: http://pastebin.com/TdbYbDyy
>>>
>>> -Jack
>>>
>>> On Sun, Sep 26, 2010 at 4:04 PM, Stack <[email protected]> wrote:
>>>> On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <[email protected]> wrote:
>>>>> I had the same issue this morning, some of the regions
>>>>> 'recovered.edits' was corrupt and no single region server was able to
>>>>> load them.  I saved them if someone is interested to see why they can
>>>>> not be processed.
>>>>
>>>>
>>>> Are they zero-length?
>>>>
>>>> If not, please put them somewhere I can pull them to take a look.
>>>>
>>>> You pasted the snippet from regionserver where we were unable to load
>>>> the recovered edits.   Were there any exceptions in the master log
>>>> when it processing the crash of the regionserver?
>>>>
>>>>> I think here is what happens:
>>>>>
>>>>> 1. I am writing data to hbase, and it hits the regionserver
>>>>> 2. It runs out of memory and kills itself.
>>>>
>>>>
>>>> Server dumps heap when it crashes (by defaut).  If you put this
>>>> somewhere that I can pull, I'll take a look at it.
>>>>
>>>>
>>>>> 3. WAL HLOG produces recovered.edits file thats corrupt.
>>>>> 4. On region reassignment no single region server can load the
>>>>> affected region, and tries over and over again, causing undo load on
>>>>> HDFS (network traffic spikes between datanodes).
>>>>>
>>>>> Is there a patch maybe that can detect this condition and throw an
>>>>> exception that deletes the recovered.edits file, we can't load it
>>>>> anyway, and this stops hbase from being able to auto-recover after one
>>>>> or more region servers die.
>>>>
>>>> I could make you something that kept going through an EOFException but
>>>> it looks like there is a flag you could set.  See
>>>> http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917
>>>>
>>>> If hbase.skip.errors = true, it should keep going rather than fail.
>>>>
>>>> St.Ack
>>>>
>>>> Now of course, one could say, just don't
>>>>> let region servers die, this even being careful is not enough here, we
>>>>> should expect RS to die cleanly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Thanks,
>>>>> -Jack
>>>>>
>>>>> PS. here is http://pastebin.com/81iV243r regionserver repeatedly
>>>>> unable to load a region.
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <[email protected]> wrote:
>>>>>> http://pastebin.com/bD3JJ0sD
>>>>>>
>>>>>> The logs were 17MB in size max, and variable sizes like that.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <[email protected]> wrote:
>>>>>>> Please paste the section from regionserver where you were getting the
>>>>>>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>>>>>>> idea moving the files aside).  Check the files too.  Are they
>>>>>>> zero-length?  If so, please look for them in the master log and paste
>>>>>>> me the section where we are splitting.
>>>>>>>
>>>>>>> Thanks Jack,
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <[email protected]> wrote:
>>>>>>>> It was EOF exception, but now that I deleted edits files:
>>>>>>>>
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>>>>>>
>>>>>>>> Like these.  All of the regions have loaded... What could that have
>>>>>>>> been?   I assume I lost some writes, but this is not a big deal to
>>>>>>>> me... question is how to avoid something like that, is that a bug?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <[email protected]> wrote:
>>>>>>>>> What is the complaint in regionserver log when region load fails?
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <[email protected]> wrote:
>>>>>>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>>>>>>> am unable to load _some_ regions, what could it be?
>>>>>>>>>>
>>>>>>>>>> 2010-09-24 16:38:42,729 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: 
>>>>>>>>>> HDFS_READ,
>>>>>>>>>> cliID: 
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 13803520, srvID: 
>>>>>>>>>> DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>>>>>>> 2010-09-24 16:38:44,317 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: 
>>>>>>>>>> HDFS_READ,
>>>>>>>>>> cliID: 
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 32723968, srvID: 
>>>>>>>>>> DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>>>>>>> 2010-09-24 16:38:44,318 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 32686080, srvID: 
>>>>>>>>>> DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>>>>>>> 2010-09-24 16:38:44,510 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>>>>>>> 2010-09-24 16:38:44,525 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: 
>>>>>>>>>> HDFS_READ,
>>>>>>>>>> cliID: 
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 48700928, srvID: 
>>>>>>>>>> DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>>>>>>> 2010-09-24 16:38:44,526 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 48695296, srvID: 
>>>>>>>>>> DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <[email protected]> wrote:
>>>>>>>>>>> (Good one Ryan)
>>>>>>>>>>>
>>>>>>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>>>>>>> config change.
>>>>>>>>>>>
>>>>>>>>>>> St.Ack
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>>>>>>
>>>>>>>>>>>> -jack
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Did you restart the master and the regionserver? Or just one or 
>>>>>>>>>>>>> the other?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <[email protected]> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>>>>>>>>>> MSG_REGION_OPEN:
>>>>>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <[email protected]> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Still having a problem:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error 
>>>>>>>>>>>>>>> opening
>>>>>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>>>>>        at 
>>>>>>>>>>>>>>> java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>>>>>>        at 
>>>>>>>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>>>>>>        at 
>>>>>>>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>>>>>>        at 
>>>>>>>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>>>>>>        at 
>>>>>>>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... 
>>>>>>>>>>>>>>> Note
>>>>>>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <[email protected]> wrote:
>>>>>>>>>>>>>>>> Try
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  <property>
>>>>>>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>>>>>>    <description>Maximum number of regions that can be assigned 
>>>>>>>>>>>>>>>> in a single go
>>>>>>>>>>>>>>>>    to a region server.
>>>>>>>>>>>>>>>>    </description>
>>>>>>>>>>>>>>>>  </property>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin 
>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster 
>>>>>>>>>>>>>>>>> I oversaturate my network links (1000 mbps) and get client 
>>>>>>>>>>>>>>>>> dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to