Looks like you have some under replicated blocks. Does that number decreases if you fsck multiple times?
Regards, Serge On 5/9/12 12:23 PM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: >On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy < >serge.blazhiyevs...@nice.com> wrote: > >> >> Whats the response from fsck look like? >> >> >[snip lots of stuff about under replicated blocks] > >......Status: HEALTHY > Total size: 246858876262 B (Total open files size: 372 B) > Total dirs: 14914 > Total files: 39248 (Files currently being written: 4) > Total blocks (validated): 40657 (avg. block size 6071743 B) (Total >open file blocks (not validated): 4) > Minimally replicated blocks: 40657 (100.0 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 1410 (3.4680374 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 2.9911454 > Corrupt blocks: 0 > Missing replicas: 2831 (2.3279145 %) > Number of data-nodes: 5 > Number of racks: 1 >FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds > > >Further information to add to this, it appear to be affecting 2 nodes in >the cluster, one more than the other though. In the last couple of hours >one of the nodes has also experienced high load, this has now dropped but >both of these nodes are now considered dead by the namenode. The first >box >load is still increasing, currently 234! I think I might have to reboot it >via IPMI. > > >> >> hadoop fsck / >> >> >> It might be the case that some of the blocks are misreplicated >> >> >> Serge >> >> Hadoopway.blogspot.com >> >> >> >> >> >> On 5/9/12 9:58 AM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: >> >> >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < >> >serge.blazhiyevs...@nice.com> wrote: >> > >> >> Take a look at your data distribution for that cluster. Maybe, it is >> >> unbalanced. >> >> >> >> >> >> Run balancer, if it isÅ >> >> >> > >> >The cluster is balanced, I ran balancer yesterday. Oddly enough the >> >problem started after I had run the balancer. >> > >> >I'm running CDH3 btw. >> > >> > >> > >> >> >> >> Regards, >> >> Serge >> >> >> >> hadoopway.blogspot.com >> >> >> >> >> >> >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: >> >> >> >> >Hi, >> >> > >> >> >I wonder if someone could give some pointers with a problem I'm >>having? >> >> > >> >> >I have a 7 machine cluster setup for testing and we have been >>pouring >> >>data >> >> >into it for a week without issue, have learnt several thing along >>the >> >>way >> >> >and solved all the problems up to now by searching online, but now >>I'm >> >> >stuck. One of the data nodes decided to have a load of 70+ this >> >>morning, >> >> >stopping datanode and tasktracker brought it back to normal, but >>every >> >> >time >> >> >I start the datanode again the load shoots through the roof, and >>all I >> >>get >> >> >in the logs is : >> >> > >> >> >STARTUP_MSG: Starting DataNode >> >> > >> >> > >> >> >STARTUP_MSG: host = pl464/10.20.16.64 >> >> > >> >> > >> >> >STARTUP_MSG: args = [] >> >> > >> >> > >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >> >> > >> >> > >> >> >STARTUP_MSG: build = >> >> >> >>>>>file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+92 >>>>>3. >> >>>19 >> >> >7-1~squeeze >> >> >-************************************************************/ >> >> > >> >> > >> >> >2012-05-09 16:12:05,925 INFO >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> >already >> >> >set up for Hadoop, not re-installing. >> >> > >> >> >2012-05-09 16:12:06,139 INFO >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> >already >> >> >set up for Hadoop, not re-installing. >> >> > >> >> >Nothing else. >> >> > >> >> >The load seems to max out only 1 of the CPUs, but the machine >>becomes >> >> >*very* unresponsive >> >> > >> >> >Anybody got any pointers of things I can try? >> >> > >> >> >Thanks >> >> >Darrell. >> >> >> >> >> >>