Thanks from jumpingin Dhruba. Should the 'failing to allocate a datanode for a block' show in the logs as an exception? (I don't see one). Would something show in logs if we were at DEBUG level?
Thanks, St.Ack dhruba Borthakur wrote: > Hi Kareem, > > I think you have run into > http://issues.apache.org/jira/browse/HADOOP-2275 > > I have a fix for this problem. You can find that fix as an attachment to the > above JIRA. If you typically build your Hadoop distribution from source, can > you pl apply the above patch to your source and then try out your experiment? > If not, that's fine.. this fix will probably go into the next release of > hadoop. > > Thanks for your help, > dhruba > > -----Original Message----- > From: Kareem Dana [mailto:[EMAIL PROTECTED] > Sent: Saturday, November 24, 2007 9:13 PM > To: [email protected] > Subject: Re: 答复: HBase PerformanceEvaluation failing > > I cleaned the logs, cleaned and reformatted the DFS. Started up DFS, > mapred, and hbase. > Ran fsck just to make sure: > [EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck / > .....Status: HEALTHY > Total size: 450 B > Total blocks: 3 (avg. block size 150 B) > Total dirs: 16 > Total files: 5 > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 0 (0.0 %) > Target replication factor: 1 > Real replication factor: 1.0 > > > The filesystem under path '/' is HEALTHY > > Started the PE Test: > [EMAIL PROTECTED]:~/hadoop$ bin/hadoop jar > /usr/research/home/kcd/hadoop/build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar > sequentialWrite 2 > 07/11/24 23:14:03 INFO hbase.PerformanceEvaluation: Table name: > TestTable, families: {info:={name: info, max versions: 3, compression: > NONE, in memory: false, max length: 2147483647, bloom filter: none}} > created > 07/11/24 23:14:04 INFO mapred.FileInputFormat: Total input paths to process : > 1 > 07/11/24 23:14:06 INFO mapred.JobClient: Running job: job_200711242311_0001 > 07/11/24 23:14:07 INFO mapred.JobClient: map 0% reduce 0% > 07/11/24 23:20:53 INFO mapred.JobClient: map 5% reduce 0% > 07/11/24 23:21:03 INFO mapred.JobClient: map 5% reduce 1% > 07/11/24 23:22:35 INFO mapred.JobClient: map 10% reduce 1% > 07/11/24 23:22:46 INFO mapred.JobClient: map 10% reduce 3% > 07/11/24 23:24:10 INFO mapred.JobClient: map 14% reduce 3% > 07/11/24 23:24:22 INFO mapred.JobClient: map 19% reduce 3% > 07/11/24 23:24:34 INFO mapred.JobClient: map 19% reduce 5% > 07/11/24 23:29:31 INFO mapred.JobClient: map 24% reduce 5% > 07/11/24 23:29:42 INFO mapred.JobClient: map 24% reduce 6% > 07/11/24 23:31:37 INFO mapred.JobClient: map 29% reduce 6% > 07/11/24 23:31:45 INFO mapred.JobClient: map 30% reduce 6% > 07/11/24 23:31:56 INFO mapred.JobClient: map 30% reduce 10% > 07/11/24 23:32:12 INFO mapred.JobClient: map 35% reduce 10% > 07/11/24 23:32:18 INFO mapred.JobClient: map 35% reduce 11% > 07/11/24 23:37:33 INFO mapred.JobClient: map 40% reduce 11% > 07/11/24 23:37:39 INFO mapred.JobClient: map 40% reduce 13% > 07/11/24 23:37:45 INFO mapred.JobClient: map 45% reduce 13% > 07/11/24 23:37:54 INFO mapred.JobClient: map 45% reduce 15% > 07/11/24 23:42:02 INFO mapred.JobClient: map 50% reduce 15% > 07/11/24 23:52:14 INFO mapred.JobClient: map 100% reduce 100% > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831) > at > org.apache.hadoop.hbase.PerformanceEvaluation.runNIsMoreThanOne(PerformanceEvaluation.java:248) > at > org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:588) > at > org.apache.hadoop.hbase.PerformanceEvaluation.doCommandLine(PerformanceEvaluation.java:680) > at > org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:699) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.util.RunJar.main(RunJar.java:155) > > > Checked DFS status again: > [EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck / > ..... > /tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data: > MISSING 1 blocks of total size 0 B. > ....... > /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/434460773960343228/data: > MISSING 1 blocks of total size 0 B. > .................................... > /tmp/hadoop-kcd/hbase/hregion_TestTable,1152738,8007445145595019431/info/mapfiles/5636229081894866081/data: > MISSING 1 blocks of total size 0 B. > .................................................... > .................................................... > /tmp/hadoop-kcd/hbase/hregion_TestTable,8426559,7501417187825603851/info/mapfiles/6826621922173844792/data: > MISSING 1 blocks of total size 0 B. > ........... > /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1556193815149242895_60020/hlog.dat.025: > MISSING 1 blocks of total size 0 B. > .Status: CORRUPT > Total size: 1823865158 B > Total blocks: 174 (avg. block size 10481983 B) > Total dirs: 180 > Total files: 164 > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 0 (0.0 %) > Target replication factor: 1 > Real replication factor: 1.0 > > > The filesystem under path '/' is CORRUPT > > All my logs are uploaded at http://www.cs.duke.edu/~kcd/hadoop-logs2/ > > The namenode log looks fine until the very end when it has a few of > these errors: > 2007-11-24 23:43:00,798 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data. > blk_-6879410450276751752 > 2007-11-24 23:43:00,798 WARN org.apache.hadoop.fs.FSNamesystem: Not > able to place enough replicas, still in need of 1 > 2007-11-24 23:43:00,808 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 1 on 54310, call > addBlock(/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data, > DFSClient_-1049549396) from 172.16.6.56:56970: error: > java.io.IOException: File > /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data > could only be replicated to 0 nodes, instead of 1 > java.io.IOException: File > /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data > could only be replicated to 0 nodes, instead of 1 > at > org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003) > at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293) > at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596) > > The hbase-region servers seem to be having a lot of problems though > according to their logs. Not quite sure what could be the cause. > > - Kareem > > On Nov 24, 2007 9:56 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Can you clear the logs, clean out the file system and run the test again? >> >> The namenode logs should tell an interesting story. >> >> >> >> On 11/24/07 6:12 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote: >> >> >>> I ran hadoop fsck and sure enough the DFS was corrupted. It seems that >>> the PerformanceEvaluation test is corrupting it. Before I run the >>> test, I ran fsck and the DFS was reported as HEALTHY. Once the PE >>> fails, the DFS is reported as corrupt. I tried to simplify my setup >>> and run the PE again. My new config is as follows: >>> >>> hadoop07 - DFS Master, Mapred master, hbase master >>> hadoop09-10 - 2 hbase region servers >>> hadoop11-12 - 2 datanodes, task trackers >>> >>> mapred.map.tasks = 2 >>> mapred.reduce.tasks = 1 >>> dfs.replication = 1 >>> >>> I ran the distributed PE in that configuration and it still failed >>> with similar errors. The output of the hadoop fsck for this run was: >>> >>> .......... >>> >>> >> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/6434881831082231493/data> >> : >> >>> MISSING 1 blocks of total size 0 B. >>> ...................................... >>> /tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,1227092681544002579/info/mapf >>> iles/5263238643231358600/data: >>> MISSING 1 blocks of total size 0 B. >>> .... >>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf >>> iles/2024298319068625138/data: >>> MISSING 1 blocks of total size 0 B. >>> .... >>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf >>> iles/5071453667327337040/data: >>> MISSING 1 blocks of total size 0 B. >>> ......... >>> /tmp/hadoop-kcd/hbase/hregion_TestTable,13932,4738192747521322482/info/mapfile >>> s/4400784113695734765/data: >>> MISSING 1 blocks of total size 0 B. >>> ................................... >>> ........................................................................ >>> /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_60020/hlog.dat.027: >>> MISSING 1 blocks of total size 0 B. >>> .Status: CORRUPT >>> Total size: 1890454330 B >>> Total blocks: 180 (avg. block size 10502524 B) >>> Total dirs: 190 >>> Total files: 173 >>> Over-replicated blocks: 0 (0.0 %) >>> Under-replicated blocks: 0 (0.0 %) >>> Target replication factor: 1 >>> Real replication factor: 1.0 >>> >>> >>> The filesystem under path '/' is CORRUPT >>> >>> >>> On Nov 24, 2007 6:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>> >>>> I think that stack was suggesting an HDFS fsck, not a disk level fsck. >>>> >>>> Try [hadoop fsck /] >>>> >>>> >>>> >>>> >>>> >>>> On 11/24/07 4:09 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>>> I do not have root access on the xen cluster I'm using. I will ask the >>>>> admin to make sure the disk is working properly. Regarding the >>>>> mismatch versions though, are you suggesting that different region >>>>> servers might be running different versions of hbase/hadoop? They are >>>>> all running the same code from the same shared storage. There isn't >>>>> even another version of hadoop anywhere for the other nodes to run. I >>>>> think I'll try dropping my cluster down to 2 nodes and working back >>>>> up... maybe I can pin point a specific problem node. Thanks for taking >>>>> a look at my logs. >>>>> >>>>> On Nov 24, 2007 5:49 PM, stack <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> I took a quick look Kareem. As with the last time, hbase keeps having >>>>>> trouble w/ the hdfs. Things start out fine around 16:00 then go bad >>>>>> because can't write reliably to the hdfs -- a variety of reasons. You >>>>>> then seem to restart the cluster around 17:37 or so and things seem to >>>>>> go along fine for a while until 19:05 when again, all regionservers >>>>>> report trouble writing the hdfs. Have you run an fsck? >>>>>> >>>> >>
