Hi Kareem,

I think you have run into 
  http://issues.apache.org/jira/browse/HADOOP-2275

I have a fix for this problem. You can find that fix as an attachment to the 
above JIRA. If you typically build your Hadoop distribution from source, can 
you pl apply the above patch to your source and then try out your experiment? 
If not, that's fine.. this fix will probably go into the next release of hadoop.

Thanks for your help,
dhruba

-----Original Message-----
From: Kareem Dana [mailto:[EMAIL PROTECTED] 
Sent: Saturday, November 24, 2007 9:13 PM
To: [email protected]
Subject: Re: 答复: HBase PerformanceEvaluation failing

I cleaned the logs, cleaned and reformatted the DFS. Started up DFS,
mapred, and hbase.
Ran fsck just to make sure:
[EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck /
.....Status: HEALTHY
 Total size:    450 B
 Total blocks:  3 (avg. block size 150 B)
 Total dirs:    16
 Total files:   5
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Target replication factor:     1
 Real replication factor:       1.0


The filesystem under path '/' is HEALTHY

Started the PE Test:
[EMAIL PROTECTED]:~/hadoop$ bin/hadoop jar
/usr/research/home/kcd/hadoop/build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar
sequentialWrite 2
07/11/24 23:14:03 INFO hbase.PerformanceEvaluation: Table name:
TestTable, families: {info:={name: info, max versions: 3, compression:
NONE, in memory: false, max length: 2147483647, bloom filter: none}}
created
07/11/24 23:14:04 INFO mapred.FileInputFormat: Total input paths to process : 1
07/11/24 23:14:06 INFO mapred.JobClient: Running job: job_200711242311_0001
07/11/24 23:14:07 INFO mapred.JobClient:  map 0% reduce 0%
07/11/24 23:20:53 INFO mapred.JobClient:  map 5% reduce 0%
07/11/24 23:21:03 INFO mapred.JobClient:  map 5% reduce 1%
07/11/24 23:22:35 INFO mapred.JobClient:  map 10% reduce 1%
07/11/24 23:22:46 INFO mapred.JobClient:  map 10% reduce 3%
07/11/24 23:24:10 INFO mapred.JobClient:  map 14% reduce 3%
07/11/24 23:24:22 INFO mapred.JobClient:  map 19% reduce 3%
07/11/24 23:24:34 INFO mapred.JobClient:  map 19% reduce 5%
07/11/24 23:29:31 INFO mapred.JobClient:  map 24% reduce 5%
07/11/24 23:29:42 INFO mapred.JobClient:  map 24% reduce 6%
07/11/24 23:31:37 INFO mapred.JobClient:  map 29% reduce 6%
07/11/24 23:31:45 INFO mapred.JobClient:  map 30% reduce 6%
07/11/24 23:31:56 INFO mapred.JobClient:  map 30% reduce 10%
07/11/24 23:32:12 INFO mapred.JobClient:  map 35% reduce 10%
07/11/24 23:32:18 INFO mapred.JobClient:  map 35% reduce 11%
07/11/24 23:37:33 INFO mapred.JobClient:  map 40% reduce 11%
07/11/24 23:37:39 INFO mapred.JobClient:  map 40% reduce 13%
07/11/24 23:37:45 INFO mapred.JobClient:  map 45% reduce 13%
07/11/24 23:37:54 INFO mapred.JobClient:  map 45% reduce 15%
07/11/24 23:42:02 INFO mapred.JobClient:  map 50% reduce 15%
07/11/24 23:52:14 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.runNIsMoreThanOne(PerformanceEvaluation.java:248)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:588)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.doCommandLine(PerformanceEvaluation.java:680)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:699)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)


Checked DFS status again:
[EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck /
.....
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data:
MISSING 1 blocks of total size 0 B.
.......
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/434460773960343228/data:
MISSING 1 blocks of total size 0 B.
....................................
/tmp/hadoop-kcd/hbase/hregion_TestTable,1152738,8007445145595019431/info/mapfiles/5636229081894866081/data:
MISSING 1 blocks of total size 0 B.
....................................................
....................................................
/tmp/hadoop-kcd/hbase/hregion_TestTable,8426559,7501417187825603851/info/mapfiles/6826621922173844792/data:
MISSING 1 blocks of total size 0 B.
...........
/tmp/hadoop-kcd/hbase/log_172.16.6.56_-1556193815149242895_60020/hlog.dat.025:
MISSING 1 blocks of total size 0 B.
.Status: CORRUPT
 Total size:    1823865158 B
 Total blocks:  174 (avg. block size 10481983 B)
 Total dirs:    180
 Total files:   164
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Target replication factor:     1
 Real replication factor:       1.0


The filesystem under path '/' is CORRUPT

All my logs are uploaded at http://www.cs.duke.edu/~kcd/hadoop-logs2/

The namenode log looks fine until the very end when it has a few of
these errors:
2007-11-24 23:43:00,798 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data.
blk_-6879410450276751752
2007-11-24 23:43:00,798 WARN org.apache.hadoop.fs.FSNamesystem: Not
able to place enough replicas, still in need of 1
2007-11-24 23:43:00,808 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310, call
addBlock(/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data,
DFSClient_-1049549396) from 172.16.6.56:56970: error:
java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
could only be replicated to 0 nodes, instead of 1
java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
could only be replicated to 0 nodes, instead of 1
        at 
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003)
        at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)

The hbase-region servers seem to be having a lot of problems though
according to their logs. Not quite sure what could be the cause.

- Kareem

On Nov 24, 2007 9:56 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> Can you clear the logs, clean out the file system and run the test again?
>
> The namenode logs should tell an interesting story.
>
>
>
> On 11/24/07 6:12 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote:
>
> > I ran hadoop fsck and sure enough the DFS was corrupted. It seems that
> > the PerformanceEvaluation test is corrupting it. Before I run the
> > test, I ran fsck and the DFS was reported as HEALTHY. Once the PE
> > fails, the DFS is reported as corrupt. I tried to simplify my setup
> > and run the PE again. My new config is as follows:
> >
> > hadoop07 - DFS Master, Mapred master, hbase master
> > hadoop09-10 - 2 hbase region servers
> > hadoop11-12 - 2 datanodes, task trackers
> >
> > mapred.map.tasks = 2
> > mapred.reduce.tasks = 1
> > dfs.replication = 1
> >
> > I ran the distributed PE in that configuration and it still failed
> > with similar errors. The output of the hadoop fsck for this run was:
> >
> > ..........
> >
> /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/6434881831082231493/data>
> :
> > MISSING 1 blocks of total size 0 B.
> > ......................................
> > /tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,1227092681544002579/info/mapf
> > iles/5263238643231358600/data:
> > MISSING 1 blocks of total size 0 B.
> > ....
> > /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
> > iles/2024298319068625138/data:
> > MISSING 1 blocks of total size 0 B.
> > ....
> > /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
> > iles/5071453667327337040/data:
> > MISSING 1 blocks of total size 0 B.
> > .........
> > /tmp/hadoop-kcd/hbase/hregion_TestTable,13932,4738192747521322482/info/mapfile
> > s/4400784113695734765/data:
> > MISSING 1 blocks of total size 0 B.
> > ...................................
> > ........................................................................
> > /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_60020/hlog.dat.027:
> > MISSING 1 blocks of total size 0 B.
> > .Status: CORRUPT
> >  Total size:    1890454330 B
> >  Total blocks:  180 (avg. block size 10502524 B)
> >  Total dirs:    190
> >  Total files:   173
> >  Over-replicated blocks:        0 (0.0 %)
> >  Under-replicated blocks:       0 (0.0 %)
> >  Target replication factor:     1
> >  Real replication factor:       1.0
> >
> >
> > The filesystem under path '/' is CORRUPT
> >
> >
> > On Nov 24, 2007 6:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >>
> >> I think that stack was suggesting an HDFS fsck, not a disk level fsck.
> >>
> >> Try [hadoop fsck /]
> >>
> >>
> >>
> >>
> >>
> >> On 11/24/07 4:09 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote:
> >>
> >>> I do not have root access on the xen cluster I'm using. I will ask the
> >>> admin to make sure the disk is working properly. Regarding the
> >>> mismatch versions though, are you suggesting that different region
> >>> servers might be running different versions of hbase/hadoop? They are
> >>> all running the same code from the same shared storage. There isn't
> >>> even another version of hadoop anywhere for the other nodes to run. I
> >>> think I'll try dropping my cluster down to 2 nodes and working back
> >>> up... maybe I can pin point a specific problem node. Thanks for taking
> >>> a look at my logs.
> >>>
> >>> On Nov 24, 2007 5:49 PM, stack <[EMAIL PROTECTED]> wrote:
> >>>> I took a quick look Kareem.   As with the last time, hbase keeps having
> >>>> trouble w/ the hdfs.  Things start out fine around 16:00 then go bad
> >>>> because can't write reliably to the hdfs -- a variety of reasons.  You
> >>>> then seem to restart the cluster around 17:37 or so and things seem to
> >>>> go along fine for a while until 19:05 when again, all regionservers
> >>>> report trouble writing the hdfs.  Have you run an fsck?
> >>
> >>
>
>

Reply via email to