dhruba Borthakur wrote: > Hi there, > > The "failing to allocate a datanode for a block" already shows up as an > exception in the namenode logs: > > >> java.io.IOException: File >> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data >> could only be replicated to 0 nodes, instead of 1 >> > > Do you agree that this message is sufficient? > > IMO, its missing *why* the block '...could only be replicated to 0 nodes...'. Is there a shortage of some critical resource that needs addressing? (Memory? Data nodes? Is it a timing issue?)
Thanks Dhurba, St.Ack > Thanks, > dhruba > > -----Original Message----- > From: stack [mailto:[EMAIL PROTECTED] > Sent: Monday, November 26, 2007 9:21 AM > To: [email protected] > Subject: Re: 答复: HBase PerformanceEvaluation failing > > Thanks from jumpingin Dhruba. > > Should the 'failing to allocate a datanode for a block' show in the logs > as an exception? (I don't see one). Would something show in logs if we > were at DEBUG level? > > Thanks, > St.Ack > > > dhruba Borthakur wrote: > >> Hi Kareem, >> >> I think you have run into >> http://issues.apache.org/jira/browse/HADOOP-2275 >> >> I have a fix for this problem. You can find that fix as an attachment to the >> above JIRA. If you typically build your Hadoop distribution from source, can >> you pl apply the above patch to your source and then try out your >> experiment? If not, that's fine.. this fix will probably go into the next >> release of hadoop. >> >> Thanks for your help, >> dhruba >> >> -----Original Message----- >> From: Kareem Dana [mailto:[EMAIL PROTECTED] >> Sent: Saturday, November 24, 2007 9:13 PM >> To: [email protected] >> Subject: Re: 答复: HBase PerformanceEvaluation failing >> >> I cleaned the logs, cleaned and reformatted the DFS. Started up DFS, >> mapred, and hbase. >> Ran fsck just to make sure: >> [EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck / >> .....Status: HEALTHY >> Total size: 450 B >> Total blocks: 3 (avg. block size 150 B) >> Total dirs: 16 >> Total files: 5 >> Over-replicated blocks: 0 (0.0 %) >> Under-replicated blocks: 0 (0.0 %) >> Target replication factor: 1 >> Real replication factor: 1.0 >> >> >> The filesystem under path '/' is HEALTHY >> >> Started the PE Test: >> [EMAIL PROTECTED]:~/hadoop$ bin/hadoop jar >> /usr/research/home/kcd/hadoop/build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar >> sequentialWrite 2 >> 07/11/24 23:14:03 INFO hbase.PerformanceEvaluation: Table name: >> TestTable, families: {info:={name: info, max versions: 3, compression: >> NONE, in memory: false, max length: 2147483647, bloom filter: none}} >> created >> 07/11/24 23:14:04 INFO mapred.FileInputFormat: Total input paths to process >> : 1 >> 07/11/24 23:14:06 INFO mapred.JobClient: Running job: job_200711242311_0001 >> 07/11/24 23:14:07 INFO mapred.JobClient: map 0% reduce 0% >> 07/11/24 23:20:53 INFO mapred.JobClient: map 5% reduce 0% >> 07/11/24 23:21:03 INFO mapred.JobClient: map 5% reduce 1% >> 07/11/24 23:22:35 INFO mapred.JobClient: map 10% reduce 1% >> 07/11/24 23:22:46 INFO mapred.JobClient: map 10% reduce 3% >> 07/11/24 23:24:10 INFO mapred.JobClient: map 14% reduce 3% >> 07/11/24 23:24:22 INFO mapred.JobClient: map 19% reduce 3% >> 07/11/24 23:24:34 INFO mapred.JobClient: map 19% reduce 5% >> 07/11/24 23:29:31 INFO mapred.JobClient: map 24% reduce 5% >> 07/11/24 23:29:42 INFO mapred.JobClient: map 24% reduce 6% >> 07/11/24 23:31:37 INFO mapred.JobClient: map 29% reduce 6% >> 07/11/24 23:31:45 INFO mapred.JobClient: map 30% reduce 6% >> 07/11/24 23:31:56 INFO mapred.JobClient: map 30% reduce 10% >> 07/11/24 23:32:12 INFO mapred.JobClient: map 35% reduce 10% >> 07/11/24 23:32:18 INFO mapred.JobClient: map 35% reduce 11% >> 07/11/24 23:37:33 INFO mapred.JobClient: map 40% reduce 11% >> 07/11/24 23:37:39 INFO mapred.JobClient: map 40% reduce 13% >> 07/11/24 23:37:45 INFO mapred.JobClient: map 45% reduce 13% >> 07/11/24 23:37:54 INFO mapred.JobClient: map 45% reduce 15% >> 07/11/24 23:42:02 INFO mapred.JobClient: map 50% reduce 15% >> 07/11/24 23:52:14 INFO mapred.JobClient: map 100% reduce 100% >> java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831) >> at >> org.apache.hadoop.hbase.PerformanceEvaluation.runNIsMoreThanOne(PerformanceEvaluation.java:248) >> at >> org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:588) >> at >> org.apache.hadoop.hbase.PerformanceEvaluation.doCommandLine(PerformanceEvaluation.java:680) >> at >> org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:699) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:585) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:155) >> >> >> Checked DFS status again: >> [EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck / >> ..... >> /tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data: >> MISSING 1 blocks of total size 0 B. >> ....... >> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/434460773960343228/data: >> MISSING 1 blocks of total size 0 B. >> .................................... >> /tmp/hadoop-kcd/hbase/hregion_TestTable,1152738,8007445145595019431/info/mapfiles/5636229081894866081/data: >> MISSING 1 blocks of total size 0 B. >> .................................................... >> .................................................... >> /tmp/hadoop-kcd/hbase/hregion_TestTable,8426559,7501417187825603851/info/mapfiles/6826621922173844792/data: >> MISSING 1 blocks of total size 0 B. >> ........... >> /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1556193815149242895_60020/hlog.dat.025: >> MISSING 1 blocks of total size 0 B. >> .Status: CORRUPT >> Total size: 1823865158 B >> Total blocks: 174 (avg. block size 10481983 B) >> Total dirs: 180 >> Total files: 164 >> Over-replicated blocks: 0 (0.0 %) >> Under-replicated blocks: 0 (0.0 %) >> Target replication factor: 1 >> Real replication factor: 1.0 >> >> >> The filesystem under path '/' is CORRUPT >> >> All my logs are uploaded at http://www.cs.duke.edu/~kcd/hadoop-logs2/ >> >> The namenode log looks fine until the very end when it has a few of >> these errors: >> 2007-11-24 23:43:00,798 INFO org.apache.hadoop.dfs.StateChange: BLOCK* >> NameSystem.allocateBlock: >> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data. >> blk_-6879410450276751752 >> 2007-11-24 23:43:00,798 WARN org.apache.hadoop.fs.FSNamesystem: Not >> able to place enough replicas, still in need of 1 >> 2007-11-24 23:43:00,808 INFO org.apache.hadoop.ipc.Server: IPC Server >> handler 1 on 54310, call >> addBlock(/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data, >> DFSClient_-1049549396) from 172.16.6.56:56970: error: >> java.io.IOException: File >> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data >> could only be replicated to 0 nodes, instead of 1 >> java.io.IOException: File >> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data >> could only be replicated to 0 nodes, instead of 1 >> at >> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003) >> at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293) >> at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:585) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596) >> >> The hbase-region servers seem to be having a lot of problems though >> according to their logs. Not quite sure what could be the cause. >> >> - Kareem >> >> On Nov 24, 2007 9:56 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> >>> Can you clear the logs, clean out the file system and run the test again? >>> >>> The namenode logs should tell an interesting story. >>> >>> >>> >>> On 11/24/07 6:12 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>>> I ran hadoop fsck and sure enough the DFS was corrupted. It seems that >>>> the PerformanceEvaluation test is corrupting it. Before I run the >>>> test, I ran fsck and the DFS was reported as HEALTHY. Once the PE >>>> fails, the DFS is reported as corrupt. I tried to simplify my setup >>>> and run the PE again. My new config is as follows: >>>> >>>> hadoop07 - DFS Master, Mapred master, hbase master >>>> hadoop09-10 - 2 hbase region servers >>>> hadoop11-12 - 2 datanodes, task trackers >>>> >>>> mapred.map.tasks = 2 >>>> mapred.reduce.tasks = 1 >>>> dfs.replication = 1 >>>> >>>> I ran the distributed PE in that configuration and it still failed >>>> with similar errors. The output of the hadoop fsck for this run was: >>>> >>>> .......... >>>> >>>> >>>> >>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/6434881831082231493/data> >>> : >>> >>> >>>> MISSING 1 blocks of total size 0 B. >>>> ...................................... >>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,1227092681544002579/info/mapf >>>> iles/5263238643231358600/data: >>>> MISSING 1 blocks of total size 0 B. >>>> .... >>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf >>>> iles/2024298319068625138/data: >>>> MISSING 1 blocks of total size 0 B. >>>> .... >>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf >>>> iles/5071453667327337040/data: >>>> MISSING 1 blocks of total size 0 B. >>>> ......... >>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,13932,4738192747521322482/info/mapfile >>>> s/4400784113695734765/data: >>>> MISSING 1 blocks of total size 0 B. >>>> ................................... >>>> ........................................................................ >>>> /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_60020/hlog.dat.027: >>>> MISSING 1 blocks of total size 0 B. >>>> .Status: CORRUPT >>>> Total size: 1890454330 B >>>> Total blocks: 180 (avg. block size 10502524 B) >>>> Total dirs: 190 >>>> Total files: 173 >>>> Over-replicated blocks: 0 (0.0 %) >>>> Under-replicated blocks: 0 (0.0 %) >>>> Target replication factor: 1 >>>> Real replication factor: 1.0 >>>> >>>> >>>> The filesystem under path '/' is CORRUPT >>>> >>>> >>>> On Nov 24, 2007 6:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>>> I think that stack was suggesting an HDFS fsck, not a disk level fsck. >>>>> >>>>> Try [hadoop fsck /] >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 11/24/07 4:09 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote: >>>>> >>>>> >>>>> >>>>>> I do not have root access on the xen cluster I'm using. I will ask the >>>>>> admin to make sure the disk is working properly. Regarding the >>>>>> mismatch versions though, are you suggesting that different region >>>>>> servers might be running different versions of hbase/hadoop? They are >>>>>> all running the same code from the same shared storage. There isn't >>>>>> even another version of hadoop anywhere for the other nodes to run. I >>>>>> think I'll try dropping my cluster down to 2 nodes and working back >>>>>> up... maybe I can pin point a specific problem node. Thanks for taking >>>>>> a look at my logs. >>>>>> >>>>>> On Nov 24, 2007 5:49 PM, stack <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> >>>>>>> I took a quick look Kareem. As with the last time, hbase keeps having >>>>>>> trouble w/ the hdfs. Things start out fine around 16:00 then go bad >>>>>>> because can't write reliably to the hdfs -- a variety of reasons. You >>>>>>> then seem to restart the cluster around 17:37 or so and things seem to >>>>>>> go along fine for a while until 19:05 when again, all regionservers >>>>>>> report trouble writing the hdfs. Have you run an fsck? >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> > >
