Darrell

Are the new dn,nn and mapred directories on the same physical disk? Nothing on 
NFS , correct?

Could you be having some hardware issue? Any clue in /var/log/messages or dmesg?

A non responsive system indicates a CPU that is really busy either doing 
something or waiting for something and the fact that it happens only on some 
nodes indicates a local problem.

Raj



>________________________________
> From: Darrell Taylor <darrell.tay...@gmail.com>
>To: common-user@hadoop.apache.org 
>Cc: Raj Vishwanathan <rajv...@yahoo.com> 
>Sent: Thursday, May 10, 2012 3:57 AM
>Subject: Re: High load on datanode startup
> 
>On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> That's real weird..
>>
>> If you can reproduce this after a reboot, I'd recommend letting the DN
>> run for a minute, and then capturing a "jstack <pid of dn>" as well as
>> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list.
>
>
>What I did after the reboot this morning was to move the my dn, nn, and
>mapred directories out of the the way, create a new one, formatted it, and
>restarted the node, it's now happy.
>
>I'll try moving the directories back later and do the jstack as you suggest.
>
>
>>
>> What JVM/JDK are you using? What OS version?
>>
>
>root@pl446:/# dpkg --get-selections | grep java
>java-common                                     install
>libjaxp1.3-java                                 install
>libjaxp1.3-java-gcj                             install
>libmysql-java                                   install
>libxerces2-java                                 install
>libxerces2-java-gcj                             install
>sun-java6-bin                                   install
>sun-java6-javadb                                install
>sun-java6-jdk                                   install
>sun-java6-jre                                   install
>
>root@pl446:/# java -version
>java version "1.6.0_26"
>Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
>Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>
>root@pl446:/# cat /etc/issue
>Debian GNU/Linux 6.0 \n \l
>
>
>
>>
>> -Todd
>>
>>
>> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor
>> <darrell.tay...@gmail.com> wrote:
>> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <rajv...@yahoo.com>
>> wrote:
>> >
>> >> The picture either too small or too pixelated for my eyes :-)
>> >>
>> >
>> > There should be a zoom option in the top right of the page that allows
>> you
>> > to view it full size
>> >
>> >
>> >>
>> >> Can you login to the box and send the output of top? If the system is
>> >> unresponsive, it has to be something more than an unbalanced hdfs
>> cluster,
>> >> methinks.
>> >>
>> >
>> > Sorry, I'm unable to login to the box, it's completely unresponsive.
>> >
>> >
>> >>
>> >> Raj
>> >>
>> >>
>> >>
>> >> >________________________________
>> >> > From: Darrell Taylor <darrell.tay...@gmail.com>
>> >> >To: common-user@hadoop.apache.org; Raj Vishwanathan <rajv...@yahoo.com
>> >
>> >> >Sent: Wednesday, May 9, 2012 2:40 PM
>> >> >Subject: Re: High load on datanode startup
>> >> >
>> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <rajv...@yahoo.com>
>> >> wrote:
>> >> >
>> >> >> When you say 'load', what do you mean? CPU load or something else?
>> >> >>
>> >> >
>> >> >I mean in the unix sense of load average, i.e. top would show a load of
>> >> >(currently) 376.
>> >> >
>> >> >Looking at Ganglia stats for the box it's not CPU load as such, the
>> graphs
>> >> >shows actual CPU usage as 30%, but the number of running processes is
>> >> >simply growing in a linear manner - screen shot of ganglia page here :
>> >> >
>> >> >
>> >>
>> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> Raj
>> >> >>
>> >> >>
>> >> >>
>> >> >> >________________________________
>> >> >> > From: Darrell Taylor <darrell.tay...@gmail.com>
>> >> >> >To: common-user@hadoop.apache.org
>> >> >> >Sent: Wednesday, May 9, 2012 9:52 AM
>> >> >> >Subject: High load on datanode startup
>> >> >> >
>> >> >> >Hi,
>> >> >> >
>> >> >> >I wonder if someone could give some pointers with a problem I'm
>> having?
>> >> >> >
>> >> >> >I have a 7 machine cluster setup for testing and we have been
>> pouring
>> >> data
>> >> >> >into it for a week without issue, have learnt several thing along
>> the
>> >> way
>> >> >> >and solved all the problems up to now by searching online, but now
>> I'm
>> >> >> >stuck.  One of the data nodes decided to have a load of 70+ this
>> >> morning,
>> >> >> >stopping datanode and tasktracker brought it back to normal, but
>> every
>> >> >> time
>> >> >> >I start the datanode again the load shoots through the roof, and
>> all I
>> >> get
>> >> >> >in the logs is :
>> >> >> >
>> >> >> >STARTUP_MSG: Starting DataNode
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   host = pl464/10.20.16.64
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   args = []
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   version = 0.20.2-cdh3u3
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   build =
>> >> >>
>> >> >>
>> >>
>> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze
>> >> >> >-************************************************************/
>> >> >> >
>> >> >> >
>> >> >> >2012-05-09 16:12:05,925 INFO
>> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration
>> >> >> already
>> >> >> >set up for Hadoop, not re-installing.
>> >> >> >
>> >> >> >2012-05-09 16:12:06,139 INFO
>> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration
>> >> >> already
>> >> >> >set up for Hadoop, not re-installing.
>> >> >> >
>> >> >> >Nothing else.
>> >> >> >
>> >> >> >The load seems to max out only 1 of the CPUs, but the machine
>> becomes
>> >> >> >*very* unresponsive
>> >> >> >
>> >> >> >Anybody got any pointers of things I can try?
>> >> >> >
>> >> >> >Thanks
>> >> >> >Darrell.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>

Reply via email to