Odd how corruption is in key hbase files. For example, here is
complaint about the hbase -ROOT- block from namenode log:
2007-11-24 23:43:31,966 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data.
blk_3947372034074506449
2007-11-24 23:43:31,966 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
place enough replicas, still in need of 1
2007-11-24 23:43:31,966 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7
on 54310, call
addBlock(/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data,
DFSClient_550313675) from 172.16.6.57:37157: error: java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data
could only be replicated to 0 nodes, instead of 1
java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data
could only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
It looks like your FS goes bad of a sudden and is no longer able to
allocate blocks (Nothing in /var/log/messages on any of the HDFS nodes
58 and 59?). If you run w/ DEBUG enabled, perhaps it will give a clue
as to why its going bad? Perhaps the HDFS experts from this list have
input on what could be going wrong? You might also try w/ a
replication-level > 1?
The NotServingRegionException in regionserver logs is 'normal'. They
are running fine until they run into the failure-to-replicate complaint
from the HDFS.
Thanks Kareem,
St.Ack
Kareem Dana wrote:
I cleaned the logs, cleaned and reformatted the DFS. Started up DFS,
mapred, and hbase.
Ran fsck just to make sure:
[EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck /
.....Status: HEALTHY
Total size: 450 B
Total blocks: 3 (avg. block size 150 B)
Total dirs: 16
Total files: 5
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0
The filesystem under path '/' is HEALTHY
Started the PE Test:
[EMAIL PROTECTED]:~/hadoop$ bin/hadoop jar
/usr/research/home/kcd/hadoop/build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar
sequentialWrite 2
07/11/24 23:14:03 INFO hbase.PerformanceEvaluation: Table name:
TestTable, families: {info:={name: info, max versions: 3, compression:
NONE, in memory: false, max length: 2147483647, bloom filter: none}}
created
07/11/24 23:14:04 INFO mapred.FileInputFormat: Total input paths to process : 1
07/11/24 23:14:06 INFO mapred.JobClient: Running job: job_200711242311_0001
07/11/24 23:14:07 INFO mapred.JobClient: map 0% reduce 0%
07/11/24 23:20:53 INFO mapred.JobClient: map 5% reduce 0%
07/11/24 23:21:03 INFO mapred.JobClient: map 5% reduce 1%
07/11/24 23:22:35 INFO mapred.JobClient: map 10% reduce 1%
07/11/24 23:22:46 INFO mapred.JobClient: map 10% reduce 3%
07/11/24 23:24:10 INFO mapred.JobClient: map 14% reduce 3%
07/11/24 23:24:22 INFO mapred.JobClient: map 19% reduce 3%
07/11/24 23:24:34 INFO mapred.JobClient: map 19% reduce 5%
07/11/24 23:29:31 INFO mapred.JobClient: map 24% reduce 5%
07/11/24 23:29:42 INFO mapred.JobClient: map 24% reduce 6%
07/11/24 23:31:37 INFO mapred.JobClient: map 29% reduce 6%
07/11/24 23:31:45 INFO mapred.JobClient: map 30% reduce 6%
07/11/24 23:31:56 INFO mapred.JobClient: map 30% reduce 10%
07/11/24 23:32:12 INFO mapred.JobClient: map 35% reduce 10%
07/11/24 23:32:18 INFO mapred.JobClient: map 35% reduce 11%
07/11/24 23:37:33 INFO mapred.JobClient: map 40% reduce 11%
07/11/24 23:37:39 INFO mapred.JobClient: map 40% reduce 13%
07/11/24 23:37:45 INFO mapred.JobClient: map 45% reduce 13%
07/11/24 23:37:54 INFO mapred.JobClient: map 45% reduce 15%
07/11/24 23:42:02 INFO mapred.JobClient: map 50% reduce 15%
07/11/24 23:52:14 INFO mapred.JobClient: map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at
org.apache.hadoop.hbase.PerformanceEvaluation.runNIsMoreThanOne(PerformanceEvaluation.java:248)
at
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:588)
at
org.apache.hadoop.hbase.PerformanceEvaluation.doCommandLine(PerformanceEvaluation.java:680)
at
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:699)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
Checked DFS status again:
[EMAIL PROTECTED]:~/hadoop$ bin/hadoop fsck /
.....
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/3562020000090370572/data:
MISSING 1 blocks of total size 0 B.
.......
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/434460773960343228/data:
MISSING 1 blocks of total size 0 B.
....................................
/tmp/hadoop-kcd/hbase/hregion_TestTable,1152738,8007445145595019431/info/mapfiles/5636229081894866081/data:
MISSING 1 blocks of total size 0 B.
....................................................
....................................................
/tmp/hadoop-kcd/hbase/hregion_TestTable,8426559,7501417187825603851/info/mapfiles/6826621922173844792/data:
MISSING 1 blocks of total size 0 B.
...........
/tmp/hadoop-kcd/hbase/log_172.16.6.56_-1556193815149242895_60020/hlog.dat.025:
MISSING 1 blocks of total size 0 B.
.Status: CORRUPT
Total size: 1823865158 B
Total blocks: 174 (avg. block size 10481983 B)
Total dirs: 180
Total files: 164
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0
The filesystem under path '/' is CORRUPT
All my logs are uploaded at http://www.cs.duke.edu/~kcd/hadoop-logs2/
The namenode log looks fine until the very end when it has a few of
these errors:
2007-11-24 23:43:00,798 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data.
blk_-6879410450276751752
2007-11-24 23:43:00,798 WARN org.apache.hadoop.fs.FSNamesystem: Not
able to place enough replicas, still in need of 1
2007-11-24 23:43:00,808 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310, call
addBlock(/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data,
DFSClient_-1049549396) from 172.16.6.56:56970: error:
java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
could only be replicated to 0 nodes, instead of 1
java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/9146068477937578655/data
could only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
The hbase-region servers seem to be having a lot of problems though
according to their logs. Not quite sure what could be the cause.
- Kareem
On Nov 24, 2007 9:56 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
Can you clear the logs, clean out the file system and run the test again?
The namenode logs should tell an interesting story.
On 11/24/07 6:12 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote:
I ran hadoop fsck and sure enough the DFS was corrupted. It seems that
the PerformanceEvaluation test is corrupting it. Before I run the
test, I ran fsck and the DFS was reported as HEALTHY. Once the PE
fails, the DFS is reported as corrupt. I tried to simplify my setup
and run the PE again. My new config is as follows:
hadoop07 - DFS Master, Mapred master, hbase master
hadoop09-10 - 2 hbase region servers
hadoop11-12 - 2 datanodes, task trackers
mapred.map.tasks = 2
mapred.reduce.tasks = 1
dfs.replication = 1
I ran the distributed PE in that configuration and it still failed
with similar errors. The output of the hadoop fsck for this run was:
..........
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/6434881831082231493/data>
:
MISSING 1 blocks of total size 0 B.
......................................
/tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,1227092681544002579/info/mapf
iles/5263238643231358600/data:
MISSING 1 blocks of total size 0 B.
....
/tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
iles/2024298319068625138/data:
MISSING 1 blocks of total size 0 B.
....
/tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,1652062411016999689/info/mapf
iles/5071453667327337040/data:
MISSING 1 blocks of total size 0 B.
.........
/tmp/hadoop-kcd/hbase/hregion_TestTable,13932,4738192747521322482/info/mapfile
s/4400784113695734765/data:
MISSING 1 blocks of total size 0 B.
...................................
........................................................................
/tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_60020/hlog.dat.027:
MISSING 1 blocks of total size 0 B.
.Status: CORRUPT
Total size: 1890454330 B
Total blocks: 180 (avg. block size 10502524 B)
Total dirs: 190
Total files: 173
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0
The filesystem under path '/' is CORRUPT
On Nov 24, 2007 6:21 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
I think that stack was suggesting an HDFS fsck, not a disk level fsck.
Try [hadoop fsck /]
On 11/24/07 4:09 PM, "Kareem Dana" <[EMAIL PROTECTED]> wrote:
I do not have root access on the xen cluster I'm using. I will ask the
admin to make sure the disk is working properly. Regarding the
mismatch versions though, are you suggesting that different region
servers might be running different versions of hbase/hadoop? They are
all running the same code from the same shared storage. There isn't
even another version of hadoop anywhere for the other nodes to run. I
think I'll try dropping my cluster down to 2 nodes and working back
up... maybe I can pin point a specific problem node. Thanks for taking
a look at my logs.
On Nov 24, 2007 5:49 PM, stack <[EMAIL PROTECTED]> wrote:
I took a quick look Kareem. As with the last time, hbase keeps having
trouble w/ the hdfs. Things start out fine around 16:00 then go bad
because can't write reliably to the hdfs -- a variety of reasons. You
then seem to restart the cluster around 17:37 or so and things seem to
go along fine for a while until 19:05 when again, all regionservers
report trouble writing the hdfs. Have you run an fsck?