Yeah, I had to retry a couple of times ("Too busy; try back later --
or sign up premium service!").It would have been nice to have wider log snippets. I'd like to have seen if the issue was double assignment. The master log snippet only shows the split. Regionserver 209's log is the one where the interesting stuff is going on around this time, 2010-03-15 16:06:51,150, but its not in the provided set. Neither are you running at DEBUG level so it'd be harder to see what is up even if you provided it. Looking in 208, I see a few exceptions beyond the one you paste below. For sure you've upped xceivers on your hdfs cluster and you've upped the file descriptors as per the 'Getting Started'? (Sorry, have to ask). Can I have more of the logs? Can I have all of the namenode log, all of the master log and 209's log? This rapidshare thing is fine with me. I don't mind retrying. Sorry it took me a while to get to this. St.Ack On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <[email protected]> wrote: > Hello Stack, > >Sorry. It's taken me a while. Let try and get to this this evening > Is it downloading the log files what take you a while? I m sorry, I used > to upload files to skydrive, but now we cant access the website. Is there > any netdisk or something you can download fast? I can upload to it. > LvZheng > 2010/3/18 Stack <[email protected]> > >> Sorry. It's taken me a while. Let try and get to this this evening >> >> Thank you for your patience >> >> >> >> >> On Mar 17, 2010, at 2:29 AM, Zheng Lv <[email protected]> wrote: >> >> Hello Stack, >>> Did you receive my mail?It looks like you didnt. >>> LvZheng >>> >>> 2010/3/16 Zheng Lv <[email protected]> >>> >>> Hello Stack, >>>> I have uploaded some parts of the logs on master, regionserver208 and >>>> regionserver210 to: >>>> http://rapidshare.com/files/363988384/master_207_log.txt.html >>>> http://rapidshare.com/files/363988673/regionserver_208_log.txt.html >>>> http://rapidshare.com/files/363988819/regionserver_210_log.txt.html >>>> I noticed that there are some LeaseExpiredException and "2010-03-15 >>>> 16:06:32,864 ERROR >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >>>> Compaction/Split failed for region ..." before 17 oclock. Did these lead >>>> to >>>> the error? Why did these happened? How to avoid these? >>>> Thanks. >>>> LvZheng >>>> 2010/3/16 Stack <[email protected]> >>>> >>>> Maybe just the master log would be sufficient from around this time to >>>>> figure the story. >>>>> St.Ack >>>>> >>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <[email protected]> wrote: >>>>> >>>>>> Hey Zheng: >>>>>> >>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <[email protected]> >>>>>> >>>>> wrote: >>>>> >>>>>> Hello Stack, >>>>>>> After we got these exceptions, we restart the cluster and restarted >>>>>>> >>>>>> the >>>>> >>>>>> job that failed, and the job succeeded. >>>>>>> Now when we access >>>>>>> >>>>>> /hbase/summary/1491233486/metrics/5046821377427277894, >>>>> >>>>>> we got " Cannot access >>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such file or >>>>>>> directory." . >>>>>>> >>>>>>> >>>>>> So, that would seem to indicate that the reference was in memory >>>>>> only.. that file was not in filesystem. You could have tried closing >>>>>> that region. It would have been interesting also to find history on >>>>>> that region, to try and figure how it came to hold in memory a >>>>>> reference to a file since removed. >>>>>> >>>>>> The messages about this file in namenode logs are in here: >>>>>>> http://rapidshare.com/files/363938595/log.txt.html >>>>>>> >>>>>> >>>>>> This is interesting. Do you have regionserver logs from 209, 208, and >>>>>> 210 for corresponding times? >>>>>> >>>>>> Thanks, >>>>>> St.Ack >>>>>> >>>>>> The job failed startted about at 17 o'clock. >>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase >>>>>>> >>>>>> version >>>>> >>>>>> we are using is 0.20.3. >>>>>>> >>>>>>> Regards, >>>>>>> LvZheng >>>>>>> 2010/3/16 Stack <[email protected]> >>>>>>> >>>>>>> Can you get that file from hdfs? >>>>>>>> >>>>>>>> ./bin/hadoop fs -get >>>>>>>>> >>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894 >>>>>>>> >>>>>>>> Does it look wholesome? Is it empty? >>>>>>>> >>>>>>>> What if you trace the life of that file in regionserver logs or >>>>>>>> probably better, over in namenode log? If you move this file aside, >>>>>>>> the region deploys? >>>>>>>> >>>>>>>> St.Ack >>>>>>>> >>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <[email protected] >>>>>>>> > >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello Everyone, >>>>>>>>> Recently we often got these in our client logs: >>>>>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying >>>>>>>>> >>>>>>>> to >>>>> >>>>>> contact region server 172.16.1.208:60020 for region >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017, >>>>> >>>>>> row >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581', >>>>> >>>>>> but failed after 10 attempts. >>>>>>>>> Exceptions: >>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename >>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894 >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616) >>>>> >>>>>> at >>>>>>>>> >>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743) >>>>> >>>>>> at java.io.DataInputStream.read(DataInputStream.java:132) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99) >>>>> >>>>>> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) >>>>>>>>> at >>>>>>>>> >>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020) >>>>> >>>>>> at >>>>>>>>> >>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65) >>>>> >>>>>> at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461) >>>>>>>>> at >>>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396) >>>>> >>>>>> at >>>>>>>>> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385) >>>>> >>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731) >>>>> >>>>>> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >>>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>>> at >>>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) >>>>> >>>>>> at >>>>>>>>> >>>>>>>> >>>>>>>> >>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >>>>> >>>>>> Is there any way to fix this problem? Or is there anything we can >>>>>>>>> >>>>>>>> do >>>>> >>>>>> even manually to relieve it? >>>>>>>>> Any suggestion? >>>>>>>>> Thank you. >>>>>>>>> LvZheng >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >
