Darrell Are the new dn,nn and mapred directories on the same physical disk? Nothing on NFS , correct?
Could you be having some hardware issue? Any clue in /var/log/messages or dmesg? A non responsive system indicates a CPU that is really busy either doing something or waiting for something and the fact that it happens only on some nodes indicates a local problem. Raj >________________________________ > From: Darrell Taylor <darrell.tay...@gmail.com> >To: common-user@hadoop.apache.org >Cc: Raj Vishwanathan <rajv...@yahoo.com> >Sent: Thursday, May 10, 2012 3:57 AM >Subject: Re: High load on datanode startup > >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <t...@cloudera.com> wrote: > >> That's real weird.. >> >> If you can reproduce this after a reboot, I'd recommend letting the DN >> run for a minute, and then capturing a "jstack <pid of dn>" as well as >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. > > >What I did after the reboot this morning was to move the my dn, nn, and >mapred directories out of the the way, create a new one, formatted it, and >restarted the node, it's now happy. > >I'll try moving the directories back later and do the jstack as you suggest. > > >> >> What JVM/JDK are you using? What OS version? >> > >root@pl446:/# dpkg --get-selections | grep java >java-common install >libjaxp1.3-java install >libjaxp1.3-java-gcj install >libmysql-java install >libxerces2-java install >libxerces2-java-gcj install >sun-java6-bin install >sun-java6-javadb install >sun-java6-jdk install >sun-java6-jre install > >root@pl446:/# java -version >java version "1.6.0_26" >Java(TM) SE Runtime Environment (build 1.6.0_26-b03) >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) > >root@pl446:/# cat /etc/issue >Debian GNU/Linux 6.0 \n \l > > > >> >> -Todd >> >> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor >> <darrell.tay...@gmail.com> wrote: >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <rajv...@yahoo.com> >> wrote: >> > >> >> The picture either too small or too pixelated for my eyes :-) >> >> >> > >> > There should be a zoom option in the top right of the page that allows >> you >> > to view it full size >> > >> > >> >> >> >> Can you login to the box and send the output of top? If the system is >> >> unresponsive, it has to be something more than an unbalanced hdfs >> cluster, >> >> methinks. >> >> >> > >> > Sorry, I'm unable to login to the box, it's completely unresponsive. >> > >> > >> >> >> >> Raj >> >> >> >> >> >> >> >> >________________________________ >> >> > From: Darrell Taylor <darrell.tay...@gmail.com> >> >> >To: common-user@hadoop.apache.org; Raj Vishwanathan <rajv...@yahoo.com >> > >> >> >Sent: Wednesday, May 9, 2012 2:40 PM >> >> >Subject: Re: High load on datanode startup >> >> > >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <rajv...@yahoo.com> >> >> wrote: >> >> > >> >> >> When you say 'load', what do you mean? CPU load or something else? >> >> >> >> >> > >> >> >I mean in the unix sense of load average, i.e. top would show a load of >> >> >(currently) 376. >> >> > >> >> >Looking at Ganglia stats for the box it's not CPU load as such, the >> graphs >> >> >shows actual CPU usage as 30%, but the number of running processes is >> >> >simply growing in a linear manner - screen shot of ganglia page here : >> >> > >> >> > >> >> >> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink >> >> > >> >> > >> >> > >> >> >> >> >> >> Raj >> >> >> >> >> >> >> >> >> >> >> >> >________________________________ >> >> >> > From: Darrell Taylor <darrell.tay...@gmail.com> >> >> >> >To: common-user@hadoop.apache.org >> >> >> >Sent: Wednesday, May 9, 2012 9:52 AM >> >> >> >Subject: High load on datanode startup >> >> >> > >> >> >> >Hi, >> >> >> > >> >> >> >I wonder if someone could give some pointers with a problem I'm >> having? >> >> >> > >> >> >> >I have a 7 machine cluster setup for testing and we have been >> pouring >> >> data >> >> >> >into it for a week without issue, have learnt several thing along >> the >> >> way >> >> >> >and solved all the problems up to now by searching online, but now >> I'm >> >> >> >stuck. One of the data nodes decided to have a load of 70+ this >> >> morning, >> >> >> >stopping datanode and tasktracker brought it back to normal, but >> every >> >> >> time >> >> >> >I start the datanode again the load shoots through the roof, and >> all I >> >> get >> >> >> >in the logs is : >> >> >> > >> >> >> >STARTUP_MSG: Starting DataNode >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: host = pl464/10.20.16.64 >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: args = [] >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >> >> >> > >> >> >> > >> >> >> >STARTUP_MSG: build = >> >> >> >> >> >> >> >> >> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze >> >> >> >-************************************************************/ >> >> >> > >> >> >> > >> >> >> >2012-05-09 16:12:05,925 INFO >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> >> already >> >> >> >set up for Hadoop, not re-installing. >> >> >> > >> >> >> >2012-05-09 16:12:06,139 INFO >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> >> already >> >> >> >set up for Hadoop, not re-installing. >> >> >> > >> >> >> >Nothing else. >> >> >> > >> >> >> >The load seems to max out only 1 of the CPUs, but the machine >> becomes >> >> >> >*very* unresponsive >> >> >> > >> >> >> >Anybody got any pointers of things I can try? >> >> >> > >> >> >> >Thanks >> >> >> >Darrell. >> >> >> > >> >> >> > >> >> >> > >> >> >> >> >> > >> >> > >> >> > >> >> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > >