I would wait for that number to go down to 0 That could a reason for your CPU utilization
Regards, Serge On 5/9/12 2:27 PM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: >On Wed, May 9, 2012 at 10:00 PM, Serge Blazhiyevskyy < >serge.blazhiyevs...@nice.com> wrote: > >> Looks like you have some under replicated blocks. Does that number >> decreases if you fsck multiple times? >> > >Yes, since my last post it's now down to 353.... > >Status: HEALTHY > Total size: 246983628437 B (Total open files size: 372 B) > Total dirs: 15172 > Total files: 39637 (Files currently being written: 7) > Total blocks (validated): 41046 (avg. block size 6017239 B) (Total >open file blocks (not validated): 6) > Minimally replicated blocks: 41046 (100.0 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 353 (0.86001074 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 3.016981 > Corrupt blocks: 0 > Missing replicas: 1774 (1.4325514 %) > Number of data-nodes: 5 > Number of racks: 1 >FSCK ended at Wed May 09 21:26:40 UTC 2012 in 904 milliseconds > > > > >> >> >> Regards, >> Serge >> >> On 5/9/12 12:23 PM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: >> >> >On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy < >> >serge.blazhiyevs...@nice.com> wrote: >> > >> >> >> >> Whats the response from fsck look like? >> >> >> >> >> >[snip lots of stuff about under replicated blocks] >> > >> >......Status: HEALTHY >> > Total size: 246858876262 B (Total open files size: 372 B) >> > Total dirs: 14914 >> > Total files: 39248 (Files currently being written: 4) >> > Total blocks (validated): 40657 (avg. block size 6071743 B) >>(Total >> >open file blocks (not validated): 4) >> > Minimally replicated blocks: 40657 (100.0 %) >> > Over-replicated blocks: 0 (0.0 %) >> > Under-replicated blocks: 1410 (3.4680374 %) >> > Mis-replicated blocks: 0 (0.0 %) >> > Default replication factor: 3 >> > Average block replication: 2.9911454 >> > Corrupt blocks: 0 >> > Missing replicas: 2831 (2.3279145 %) >> > Number of data-nodes: 5 >> > Number of racks: 1 >> >FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds >> > >> > >> >Further information to add to this, it appear to be affecting 2 nodes >>in >> >the cluster, one more than the other though. In the last couple of >>hours >> >one of the nodes has also experienced high load, this has now dropped >>but >> >both of these nodes are now considered dead by the namenode. The first >> >box >> >load is still increasing, currently 234! I think I might have to >>reboot it >> >via IPMI. >> > >> > >> >> >> >> hadoop fsck / >> >> >> >> >> >> It might be the case that some of the blocks are misreplicated >> >> >> >> >> >> Serge >> >> >> >> Hadoopway.blogspot.com >> >> >> >> >> >> >> >> >> >> >> >> On 5/9/12 9:58 AM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: >> >> >> >> >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < >> >> >serge.blazhiyevs...@nice.com> wrote: >> >> > >> >> >> Take a look at your data distribution for that cluster. Maybe, it >>is >> >> >> unbalanced. >> >> >> >> >> >> >> >> >> Run balancer, if it isÅ >> >> >> >> >> > >> >> >The cluster is balanced, I ran balancer yesterday. Oddly enough the >> >> >problem started after I had run the balancer. >> >> > >> >> >I'm running CDH3 btw. >> >> > >> >> > >> >> > >> >> >> >> >> >> Regards, >> >> >> Serge >> >> >> >> >> >> hadoopway.blogspot.com >> >> >> >> >> >> >> >> >> >> >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <darrell.tay...@gmail.com> >> wrote: >> >> >> >> >> >> >Hi, >> >> >> > >> >> >> >I wonder if someone could give some pointers with a problem I'm >> >>having? >> >> >> > >> >> >> >I have a 7 machine cluster setup for testing and we have been >> >>pouring >> >> >>data >> >> >> >into it for a week without issue, have learnt several thing along >> >>the >> >> >>way >> >> >> >and solved all the problems up to now by searching online, but >>now >> >>I'm >> >> >> >stuck. One of the data nodes decided to have a load of 70+ this >> >> >>morning, >> >> >> >stopping datanode and tasktracker brought it back to normal, but >> >>every >> >> >> >time >> >> >> >I start the datanode again the load shoots through the roof, and >> >>all I >> >> >>get >> >> >> >in the logs is : >> >> >> > >> >> >> >STARTUP_MSG: Starting DataNode >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: host = pl464/10.20.16.64 >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: args = [] >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: build = >> >> >> >> >> >> >>>>>>>file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+ >>>>>>>92 >> >>>>>3. >> >> >>>19 >> >> >> >7-1~squeeze >> >> >> >-************************************************************/ >> >> >> > >> >> >> > >> >> >> >2012-05-09 16:12:05,925 INFO >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS >>Configuration >> >> >> >already >> >> >> >set up for Hadoop, not re-installing. >> >> >> > >> >> >> >2012-05-09 16:12:06,139 INFO >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS >>Configuration >> >> >> >already >> >> >> >set up for Hadoop, not re-installing. >> >> >> > >> >> >> >Nothing else. >> >> >> > >> >> >> >The load seems to max out only 1 of the CPUs, but the machine >> >>becomes >> >> >> >*very* unresponsive >> >> >> > >> >> >> >Anybody got any pointers of things I can try? >> >> >> > >> >> >> >Thanks >> >> >> >Darrell. >> >> >> >> >> >> >> >> >> >> >> >>