They were about 64MB, I will put them somewhere... like here:
http://img2.imageshack.us/0000000000618601094
http://img2.imageshack.us/0000000000618601136

They are not gzippable, sorry... full of jpeg data I think.
Here is an error snipper from master: http://pastebin.com/TdbYbDyy

-Jack

On Sun, Sep 26, 2010 at 4:04 PM, Stack <[email protected]> wrote:
> On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <[email protected]> wrote:
>> I had the same issue this morning, some of the regions
>> 'recovered.edits' was corrupt and no single region server was able to
>> load them.  I saved them if someone is interested to see why they can
>> not be processed.
>
>
> Are they zero-length?
>
> If not, please put them somewhere I can pull them to take a look.
>
> You pasted the snippet from regionserver where we were unable to load
> the recovered edits.   Were there any exceptions in the master log
> when it processing the crash of the regionserver?
>
>> I think here is what happens:
>>
>> 1. I am writing data to hbase, and it hits the regionserver
>> 2. It runs out of memory and kills itself.
>
>
> Server dumps heap when it crashes (by defaut).  If you put this
> somewhere that I can pull, I'll take a look at it.
>
>
>> 3. WAL HLOG produces recovered.edits file thats corrupt.
>> 4. On region reassignment no single region server can load the
>> affected region, and tries over and over again, causing undo load on
>> HDFS (network traffic spikes between datanodes).
>>
>> Is there a patch maybe that can detect this condition and throw an
>> exception that deletes the recovered.edits file, we can't load it
>> anyway, and this stops hbase from being able to auto-recover after one
>> or more region servers die.
>
> I could make you something that kept going through an EOFException but
> it looks like there is a flag you could set.  See
> http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917
>
> If hbase.skip.errors = true, it should keep going rather than fail.
>
> St.Ack
>
> Now of course, one could say, just don't
>> let region servers die, this even being careful is not enough here, we
>> should expect RS to die cleanly.
>>
>> Thoughts?
>>
>> Thanks,
>> -Jack
>>
>> PS. here is http://pastebin.com/81iV243r regionserver repeatedly
>> unable to load a region.
>>
>> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <[email protected]> wrote:
>>> http://pastebin.com/bD3JJ0sD
>>>
>>> The logs were 17MB in size max, and variable sizes like that.
>>>
>>> -Jack
>>>
>>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <[email protected]> wrote:
>>>> Please paste the section from regionserver where you were getting the
>>>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>>>> idea moving the files aside).  Check the files too.  Are they
>>>> zero-length?  If so, please look for them in the master log and paste
>>>> me the section where we are splitting.
>>>>
>>>> Thanks Jack,
>>>> St.Ack
>>>>
>>>>
>>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <[email protected]> wrote:
>>>>> It was EOF exception, but now that I deleted edits files:
>>>>>
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>>>
>>>>> Like these.  All of the regions have loaded... What could that have
>>>>> been?   I assume I lost some writes, but this is not a big deal to
>>>>> me... question is how to avoid something like that, is that a bug?
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <[email protected]> wrote:
>>>>>> What is the complaint in regionserver log when region load fails?
>>>>>> St.Ack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <[email protected]> wrote:
>>>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>>>> am unable to load _some_ regions, what could it be?
>>>>>>>
>>>>>>> 2010-09-24 16:38:42,729 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>>>>> cliID: 
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>>>> 2010-09-24 16:38:44,317 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>>>>> cliID: 
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>>>> 2010-09-24 16:38:44,318 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>>>> HDFS_READ, cliID:
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>>>> 2010-09-24 16:38:44,510 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>>>> HDFS_READ, cliID:
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>>>> 2010-09-24 16:38:44,525 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>>>>> cliID: 
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>>>> 2010-09-24 16:38:44,526 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>>>> HDFS_READ, cliID:
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <[email protected]> wrote:
>>>>>>>> (Good one Ryan)
>>>>>>>>
>>>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>>>> config change.
>>>>>>>>
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <[email protected]> wrote:
>>>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>>>
>>>>>>>>> -jack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>> Did you restart the master and the regionserver? Or just one or the 
>>>>>>>>>> other?
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>>>
>>>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>>>
>>>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Still having a problem:
>>>>>>>>>>>>
>>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>>        at 
>>>>>>>>>>>> java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>>>        at 
>>>>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>>>        at 
>>>>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>>>        at 
>>>>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>>>        at 
>>>>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <[email protected]> wrote:
>>>>>>>>>>>>> Try
>>>>>>>>>>>>>
>>>>>>>>>>>>>  <property>
>>>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>>>    <description>Maximum number of regions that can be assigned in 
>>>>>>>>>>>>> a single go
>>>>>>>>>>>>>    to a region server.
>>>>>>>>>>>>>    </description>
>>>>>>>>>>>>>  </property>
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <[email protected]> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I 
>>>>>>>>>>>>>> oversaturate my network links (1000 mbps) and get client dfs 
>>>>>>>>>>>>>> timeouts , anyway to slow the m down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to