They were about 64MB, I will put them somewhere... like here: http://img2.imageshack.us/0000000000618601094 http://img2.imageshack.us/0000000000618601136
They are not gzippable, sorry... full of jpeg data I think. Here is an error snipper from master: http://pastebin.com/TdbYbDyy -Jack On Sun, Sep 26, 2010 at 4:04 PM, Stack <[email protected]> wrote: > On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <[email protected]> wrote: >> I had the same issue this morning, some of the regions >> 'recovered.edits' was corrupt and no single region server was able to >> load them. I saved them if someone is interested to see why they can >> not be processed. > > > Are they zero-length? > > If not, please put them somewhere I can pull them to take a look. > > You pasted the snippet from regionserver where we were unable to load > the recovered edits. Were there any exceptions in the master log > when it processing the crash of the regionserver? > >> I think here is what happens: >> >> 1. I am writing data to hbase, and it hits the regionserver >> 2. It runs out of memory and kills itself. > > > Server dumps heap when it crashes (by defaut). If you put this > somewhere that I can pull, I'll take a look at it. > > >> 3. WAL HLOG produces recovered.edits file thats corrupt. >> 4. On region reassignment no single region server can load the >> affected region, and tries over and over again, causing undo load on >> HDFS (network traffic spikes between datanodes). >> >> Is there a patch maybe that can detect this condition and throw an >> exception that deletes the recovered.edits file, we can't load it >> anyway, and this stops hbase from being able to auto-recover after one >> or more region servers die. > > I could make you something that kept going through an EOFException but > it looks like there is a flag you could set. See > http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917 > > If hbase.skip.errors = true, it should keep going rather than fail. > > St.Ack > > Now of course, one could say, just don't >> let region servers die, this even being careful is not enough here, we >> should expect RS to die cleanly. >> >> Thoughts? >> >> Thanks, >> -Jack >> >> PS. here is http://pastebin.com/81iV243r regionserver repeatedly >> unable to load a region. >> >> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <[email protected]> wrote: >>> http://pastebin.com/bD3JJ0sD >>> >>> The logs were 17MB in size max, and variable sizes like that. >>> >>> -Jack >>> >>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <[email protected]> wrote: >>>> Please paste the section from regionserver where you were getting the >>>> EOF to pastebin. I'd like to see exactly where (but yeah, you get the >>>> idea moving the files aside). Check the files too. Are they >>>> zero-length? If so, please look for them in the master log and paste >>>> me the section where we are splitting. >>>> >>>> Thanks Jack, >>>> St.Ack >>>> >>>> >>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <[email protected]> wrote: >>>>> It was EOF exception, but now that I deleted edits files: >>>>> >>>>> Moved to trash: >>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532 >>>>> Moved to trash: >>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530 >>>>> Moved to trash: >>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642 >>>>> Moved to trash: >>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059 >>>>> Moved to trash: >>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019 >>>>> >>>>> Like these. All of the regions have loaded... What could that have >>>>> been? I assume I lost some writes, but this is not a big deal to >>>>> me... question is how to avoid something like that, is that a bug? >>>>> >>>>> -Jack >>>>> >>>>> >>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <[email protected]> wrote: >>>>>> What is the complaint in regionserver log when region load fails? >>>>>> St.Ack >>>>>> >>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <[email protected]> wrote: >>>>>>> so, datanode log shows no errors whatsoever, however I do see same >>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I >>>>>>> am unable to load _some_ regions, what could it be? >>>>>>> >>>>>>> 2010-09-24 16:38:42,729 INFO >>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ, >>>>>>> cliID: >>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237, >>>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569, >>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413 >>>>>>> 2010-09-24 16:38:44,317 INFO >>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ, >>>>>>> cliID: >>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237, >>>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569, >>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653 >>>>>>> 2010-09-24 16:38:44,318 INFO >>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op: >>>>>>> HDFS_READ, cliID: >>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237, >>>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569, >>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929 >>>>>>> 2010-09-24 16:38:44,510 INFO >>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op: >>>>>>> HDFS_READ, cliID: >>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237, >>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569, >>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261 >>>>>>> 2010-09-24 16:38:44,525 INFO >>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ, >>>>>>> cliID: >>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237, >>>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569, >>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045 >>>>>>> 2010-09-24 16:38:44,526 INFO >>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op: >>>>>>> HDFS_READ, cliID: >>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237, >>>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569, >>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <[email protected]> wrote: >>>>>>>> (Good one Ryan) >>>>>>>> >>>>>>>> Master is doing the assigning. It needs to be restarted to see the >>>>>>>> config change. >>>>>>>> >>>>>>>> St.Ack >>>>>>>> >>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <[email protected]> wrote: >>>>>>>>> Only regionserver, do I need to restart both? >>>>>>>>> >>>>>>>>> -jack >>>>>>>>> >>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> Did you restart the master and the regionserver? Or just one or the >>>>>>>>>> other? >>>>>>>>>> >>>>>>>>>> -ryan >>>>>>>>>> >>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>>> Also, even with '1' value, I see: >>>>>>>>>>> >>>>>>>>>>> 2010-09-24 16:20:29,983 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0. >>>>>>>>>>> 2010-09-24 16:20:29,984 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465 >>>>>>>>>>> 2010-09-24 16:20:29,984 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45. >>>>>>>>>>> 2010-09-24 16:20:29,985 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead. >>>>>>>>>>> 2010-09-24 16:20:29,985 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea. >>>>>>>>>>> 2010-09-24 16:20:29,985 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25. >>>>>>>>>>> 2010-09-24 16:20:29,985 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278. >>>>>>>>>>> 2010-09-24 16:20:29,985 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe. >>>>>>>>>>> 2010-09-24 16:20:29,985 INFO >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: >>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007 >>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region >>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af. >>>>>>>>>>> >>>>>>>>>>> Which is pretty close to concurrent. >>>>>>>>>>> >>>>>>>>>>> -Jack >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>> Still having a problem: >>>>>>>>>>>> >>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR >>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening >>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf. >>>>>>>>>>>> java.io.EOFException >>>>>>>>>>>> at >>>>>>>>>>>> java.io.DataInputStream.readFully(DataInputStream.java:180) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note >>>>>>>>>>>> that my hdfs is not having a problem. >>>>>>>>>>>> >>>>>>>>>>>> -Jack >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <[email protected]> wrote: >>>>>>>>>>>>> Try >>>>>>>>>>>>> >>>>>>>>>>>>> <property> >>>>>>>>>>>>> <name>hbase.regions.percheckin</name> >>>>>>>>>>>>> <value>10</value> >>>>>>>>>>>>> <description>Maximum number of regions that can be assigned in >>>>>>>>>>>>> a single go >>>>>>>>>>>>> to a region server. >>>>>>>>>>>>> </description> >>>>>>>>>>>>> </property> >>>>>>>>>>>>> >>>>>>>>>>>>> What do you have now? Whatever it is, go down from there. >>>>>>>>>>>>> >>>>>>>>>>>>> St.Ack >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I >>>>>>>>>>>>>> oversaturate my network links (1000 mbps) and get client dfs >>>>>>>>>>>>>> timeouts , anyway to slow the m down? >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Jack >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
