Can you clear the logs, clean out the file system and run the test again? The namenode logs should tell an interesting story.
On 11/24/07 6:12 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote: > I ran hadoop fsck and sure enough the DFS was corrupted. It seems that > the PerformanceEvaluation test is corrupting it. Before I run the > test, I ran fsck and the DFS was reported as HEALTHY. Once the PE > fails, the DFS is reported as corrupt. I tried to simplify my setup > and run the PE again. My new config is as follows: > > hadoop07 - DFS Master, Mapred master, hbase master > hadoop09-10 - 2 hbase region servers > hadoop11-12 - 2 datanodes, task trackers > > mapred.map.tasks = 2 > mapred.reduce.tasks = 1 > dfs.replication = 1 > > I ran the distributed PE in that configuration and it still failed > with similar errors. The output of the hadoop fsck for this run was: > > .......... > /tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/6434881831082231493/data> : > MISSING 1 blocks of total size 0 B. > ...................................... > /tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,1227092681544002579/info/mapf > iles/5263238643231358600/data: > MISSING 1 blocks of total size 0 B. > .... > /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf > iles/2024298319068625138/data: > MISSING 1 blocks of total size 0 B. > .... > /tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf > iles/5071453667327337040/data: > MISSING 1 blocks of total size 0 B. > ......... > /tmp/hadoop-kcd/hbase/hregion_TestTable,13932,4738192747521322482/info/mapfile > s/4400784113695734765/data: > MISSING 1 blocks of total size 0 B. > ................................... > ........................................................................ > /tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_60020/hlog.dat.027: > MISSING 1 blocks of total size 0 B. > .Status: CORRUPT > Total size: 1890454330 B > Total blocks: 180 (avg. block size 10502524 B) > Total dirs: 190 > Total files: 173 > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 0 (0.0 %) > Target replication factor: 1 > Real replication factor: 1.0 > > > The filesystem under path '/' is CORRUPT > > > On Nov 24, 2007 6:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> I think that stack was suggesting an HDFS fsck, not a disk level fsck. >> >> Try [hadoop fsck /] >> >> >> >> >> >> On 11/24/07 4:09 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote: >> >>> I do not have root access on the xen cluster I'm using. I will ask the >>> admin to make sure the disk is working properly. Regarding the >>> mismatch versions though, are you suggesting that different region >>> servers might be running different versions of hbase/hadoop? They are >>> all running the same code from the same shared storage. There isn't >>> even another version of hadoop anywhere for the other nodes to run. I >>> think I'll try dropping my cluster down to 2 nodes and working back >>> up... maybe I can pin point a specific problem node. Thanks for taking >>> a look at my logs. >>> >>> On Nov 24, 2007 5:49 PM, stack <[EMAIL PROTECTED]> wrote: >>>> I took a quick look Kareem. As with the last time, hbase keeps having >>>> trouble w/ the hdfs. Things start out fine around 16:00 then go bad >>>> because can't write reliably to the hdfs -- a variety of reasons. You >>>> then seem to restart the cluster around 17:37 or so and things seem to >>>> go along fine for a while until 19:05 when again, all regionservers >>>> report trouble writing the hdfs. Have you run an fsck? >> >>
