Re: 答复: HBase PerformanceEvaluation fai ling

stack Mon, 26 Nov 2007 10:30:48 -0800

dhruba Borthakur wrote:
> Hi there,
>
> The "failing to allocate a datanode for a block" already shows up as an 
> exception in the namenode logs:
>
>   
>> java.io.IOException: File
>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
>> could only be replicated to 0 nodes, instead of 1
>>     
>
> Do you agree that this message is sufficient?
>
>   
IMO, its missing *why* the block '...could only be replicated to 0
nodes...'. Is there a shortage of some critical resource that needs
addressing? (Memory? Data nodes? Is it a timing issue?)


Thanks Dhurba,
St.Ack


> Thanks,
> dhruba
>
> -----Original Message-----
> From: stack [mailto:[EMAIL PROTECTED] 
> Sent: Monday, November 26, 2007 9:21 AM
> To: [email protected]
> Subject: Re: 答复: HBase PerformanceEvaluation failing
>
> Thanks from jumpingin Dhruba.
>
> Should the 'failing to allocate a datanode for a block' show in the logs
> as an exception? (I don't see one). Would something show in logs if we
> were at DEBUG level?
>
> Thanks,
> St.Ack
>
>
> dhruba Borthakur wrote:
>   
>> Hi Kareem,
>>
>> I think you have run into 
>>   http://issues.apache.org/jira/browse/HADOOP-2275
>>
>> I have a fix for this problem. You can find that fix as an attachment to the 
>> above JIRA. If you typically build your Hadoop distribution from source, can 
>> you pl apply the above patch to your source and then try out your 
>> experiment? If not, that's fine.. this fix will probably go into the next 
>> release of hadoop.
>>
>> Thanks for your help,
>> dhruba
>>
>> -----Original Message-----
>> From: Kareem Dana [mailto:[EMAIL PROTECTED] 
>> Sent: Saturday, November 24, 2007 9:13 PM
>> To: [email protected]
>> Subject: Re: 答复: HBase PerformanceEvaluation failing
>>
>> I cleaned the logs, cleaned and reformatted the DFS. Started up DFS,
>> mapred, and hbase.
>> Ran fsck just to make sure:
>> [EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck /
>> .....Status: HEALTHY
>>  Total size:    450 B
>>  Total blocks:  3 (avg. block size 150 B)
>>  Total dirs:    16
>>  Total files:   5
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       0 (0.0 %)
>>  Target replication factor:     1
>>  Real replication factor:       1.0
>>
>>
>> The filesystem under path '/' is HEALTHY
>>
>> Started the PE Test:
>> [EMAIL PROTECTED]:~/hadoop$ bin/hadoop jar
>> /usr/research/home/kcd/hadoop/build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar
>> sequentialWrite 2
>> 07/11/24 23:14:03 INFO hbase.PerformanceEvaluation: Table name:
>> TestTable, families: {info:={name: info, max versions: 3, compression:
>> NONE, in memory: false, max length: 2147483647, bloom filter: none}}
>> created
>> 07/11/24 23:14:04 INFO mapred.FileInputFormat: Total input paths to process 
>> : 1
>> 07/11/24 23:14:06 INFO mapred.JobClient: Running job: job_200711242311_0001
>> 07/11/24 23:14:07 INFO mapred.JobClient:  map 0% reduce 0%
>> 07/11/24 23:20:53 INFO mapred.JobClient:  map 5% reduce 0%
>> 07/11/24 23:21:03 INFO mapred.JobClient:  map 5% reduce 1%
>> 07/11/24 23:22:35 INFO mapred.JobClient:  map 10% reduce 1%
>> 07/11/24 23:22:46 INFO mapred.JobClient:  map 10% reduce 3%
>> 07/11/24 23:24:10 INFO mapred.JobClient:  map 14% reduce 3%
>> 07/11/24 23:24:22 INFO mapred.JobClient:  map 19% reduce 3%
>> 07/11/24 23:24:34 INFO mapred.JobClient:  map 19% reduce 5%
>> 07/11/24 23:29:31 INFO mapred.JobClient:  map 24% reduce 5%
>> 07/11/24 23:29:42 INFO mapred.JobClient:  map 24% reduce 6%
>> 07/11/24 23:31:37 INFO mapred.JobClient:  map 29% reduce 6%
>> 07/11/24 23:31:45 INFO mapred.JobClient:  map 30% reduce 6%
>> 07/11/24 23:31:56 INFO mapred.JobClient:  map 30% reduce 10%
>> 07/11/24 23:32:12 INFO mapred.JobClient:  map 35% reduce 10%
>> 07/11/24 23:32:18 INFO mapred.JobClient:  map 35% reduce 11%
>> 07/11/24 23:37:33 INFO mapred.JobClient:  map 40% reduce 11%
>> 07/11/24 23:37:39 INFO mapred.JobClient:  map 40% reduce 13%
>> 07/11/24 23:37:45 INFO mapred.JobClient:  map 45% reduce 13%
>> 07/11/24 23:37:54 INFO mapred.JobClient:  map 45% reduce 15%
>> 07/11/24 23:42:02 INFO mapred.JobClient:  map 50% reduce 15%
>> 07/11/24 23:52:14 INFO mapred.JobClient:  map 100% reduce 100%
>> java.io.IOException: Job failed!
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
>>         at 
>> org.apache.hadoop.hbase.PerformanceEvaluation.runNIsMoreThanOne(PerformanceEvaluation.java:248)
>>         at 
>> org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:588)
>>         at 
>> org.apache.hadoop.hbase.PerformanceEvaluation.doCommandLine(PerformanceEvaluation.java:680)
>>         at 
>> org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:699)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:585)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>
>>
>> Checked DFS status again:
>> [EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck /
>> .....
>> /tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data:
>> MISSING 1 blocks of total size 0 B.
>> .......
>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/434460773960343228/data:
>> MISSING 1 blocks of total size 0 B.
>> ....................................
>> /tmp/hadoop-kcd/hbase/hregion_TestTable,1152738,8007445145595019431/info/mapfiles/5636229081894866081/data:
>> MISSING 1 blocks of total size 0 B.
>> ....................................................
>> ....................................................
>> /tmp/hadoop-kcd/hbase/hregion_TestTable,8426559,7501417187825603851/info/mapfiles/6826621922173844792/data:
>> MISSING 1 blocks of total size 0 B.
>> ...........
>> /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1556193815149242895_60020/hlog.dat.025:
>> MISSING 1 blocks of total size 0 B.
>> .Status: CORRUPT
>>  Total size:    1823865158 B
>>  Total blocks:  174 (avg. block size 10481983 B)
>>  Total dirs:    180
>>  Total files:   164
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       0 (0.0 %)
>>  Target replication factor:     1
>>  Real replication factor:       1.0
>>
>>
>> The filesystem under path '/' is CORRUPT
>>
>> All my logs are uploaded at http://www.cs.duke.edu/~kcd/hadoop-logs2/
>>
>> The namenode log looks fine until the very end when it has a few of
>> these errors:
>> 2007-11-24 23:43:00,798 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
>> NameSystem.allocateBlock:
>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data.
>> blk_-6879410450276751752
>> 2007-11-24 23:43:00,798 WARN org.apache.hadoop.fs.FSNamesystem: Not
>> able to place enough replicas, still in need of 1
>> 2007-11-24 23:43:00,808 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 1 on 54310, call
>> addBlock(/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data,
>> DFSClient_-1049549396) from 172.16.6.56:56970: error:
>> java.io.IOException: File
>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
>> could only be replicated to 0 nodes, instead of 1
>> java.io.IOException: File
>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
>> could only be replicated to 0 nodes, instead of 1
>>      at 
>> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003)
>>      at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
>>      at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>>      at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>      at java.lang.reflect.Method.invoke(Method.java:585)
>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
>>
>> The hbase-region servers seem to be having a lot of problems though
>> according to their logs. Not quite sure what could be the cause.
>>
>> - Kareem
>>
>> On Nov 24, 2007 9:56 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>>   
>>     
>>> Can you clear the logs, clean out the file system and run the test again?
>>>
>>> The namenode logs should tell an interesting story.
>>>
>>>
>>>
>>> On 11/24/07 6:12 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote:
>>>
>>>     
>>>       
>>>> I ran hadoop fsck and sure enough the DFS was corrupted. It seems that
>>>> the PerformanceEvaluation test is corrupting it. Before I run the
>>>> test, I ran fsck and the DFS was reported as HEALTHY. Once the PE
>>>> fails, the DFS is reported as corrupt. I tried to simplify my setup
>>>> and run the PE again. My new config is as follows:
>>>>
>>>> hadoop07 - DFS Master, Mapred master, hbase master
>>>> hadoop09-10 - 2 hbase region servers
>>>> hadoop11-12 - 2 datanodes, task trackers
>>>>
>>>> mapred.map.tasks = 2
>>>> mapred.reduce.tasks = 1
>>>> dfs.replication = 1
>>>>
>>>> I ran the distributed PE in that configuration and it still failed
>>>> with similar errors. The output of the hadoop fsck for this run was:
>>>>
>>>> ..........
>>>>
>>>>       
>>>>         
>>> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/6434881831082231493/data>
>>> :
>>>     
>>>       
>>>> MISSING 1 blocks of total size 0 B.
>>>> ......................................
>>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,1227092681544002579/info/mapf
>>>> iles/5263238643231358600/data:
>>>> MISSING 1 blocks of total size 0 B.
>>>> ....
>>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
>>>> iles/2024298319068625138/data:
>>>> MISSING 1 blocks of total size 0 B.
>>>> ....
>>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
>>>> iles/5071453667327337040/data:
>>>> MISSING 1 blocks of total size 0 B.
>>>> .........
>>>> /tmp/hadoop-kcd/hbase/hregion_TestTable,13932,4738192747521322482/info/mapfile
>>>> s/4400784113695734765/data:
>>>> MISSING 1 blocks of total size 0 B.
>>>> ...................................
>>>> ........................................................................
>>>> /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_60020/hlog.dat.027:
>>>> MISSING 1 blocks of total size 0 B.
>>>> .Status: CORRUPT
>>>>  Total size:    1890454330 B
>>>>  Total blocks:  180 (avg. block size 10502524 B)
>>>>  Total dirs:    190
>>>>  Total files:   173
>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>  Under-replicated blocks:       0 (0.0 %)
>>>>  Target replication factor:     1
>>>>  Real replication factor:       1.0
>>>>
>>>>
>>>> The filesystem under path '/' is CORRUPT
>>>>
>>>>
>>>> On Nov 24, 2007 6:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>>>>       
>>>>         
>>>>> I think that stack was suggesting an HDFS fsck, not a disk level fsck.
>>>>>
>>>>> Try [hadoop fsck /]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 11/24/07 4:09 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>         
>>>>>           
>>>>>> I do not have root access on the xen cluster I'm using. I will ask the
>>>>>> admin to make sure the disk is working properly. Regarding the
>>>>>> mismatch versions though, are you suggesting that different region
>>>>>> servers might be running different versions of hbase/hadoop? They are
>>>>>> all running the same code from the same shared storage. There isn't
>>>>>> even another version of hadoop anywhere for the other nodes to run. I
>>>>>> think I'll try dropping my cluster down to 2 nodes and working back
>>>>>> up... maybe I can pin point a specific problem node. Thanks for taking
>>>>>> a look at my logs.
>>>>>>
>>>>>> On Nov 24, 2007 5:49 PM, stack <[EMAIL PROTECTED]> wrote:
>>>>>>           
>>>>>>             
>>>>>>> I took a quick look Kareem.   As with the last time, hbase keeps having
>>>>>>> trouble w/ the hdfs.  Things start out fine around 16:00 then go bad
>>>>>>> because can't write reliably to the hdfs -- a variety of reasons.  You
>>>>>>> then seem to restart the cluster around 17:37 or so and things seem to
>>>>>>> go along fine for a while until 19:05 when again, all regionservers
>>>>>>> report trouble writing the hdfs.  Have you run an fsck?
>>>>>>>             
>>>>>>>               
>>>>>         
>>>>>           
>>>     
>>>       
>
>

Re: 答复: HBase PerformanceEvaluation fai ling

Reply via email to