Re: High load on datanode startup

Harsh J Fri, 11 May 2012 03:37:43 -0700

Doesn't look like the $HBASE_HOME/bin/hbase script runs
"$HADOOP_HOME/bin/hadoop classpath" directly. Its classpath builder
seems to add $HADOOP_HOME items manually via listing/etc.. Perhaps if
hbase-env.sh has a HBASE_CLASSPATH that imports `hadoop classpath`,
and the hadoop-env.sh has a `hbase classpath` this issue could happen.


I do know that `hbase classpath` may take very long and/or hang over
network calls if there's a target/build directory inside of
$HBASE_HOME, which causes it to use maven to generate a classpath
instead of using a cached file/local gen. Generally doing mvn clean
solves that up for me, whenever it happens over my installs.

On Fri, May 11, 2012 at 3:02 PM, Todd Lipcon <t...@cloudera.com> wrote:
> On Fri, May 11, 2012 at 2:29 AM, Darrell Taylor
> <darrell.tay...@gmail.com> wrote:
>>
>> What I saw on the machine was thousands of recursive processes in ps of the
>> form 'bash /usr/bin/hbase classpath...',  Stopping everything didn't clean
>> the processes up so had to kill them manually with some grep/xargs foo.
>>  Once this was all cleaned up and the hadoop-env.sh file removed the nodes
>> seem to be happy again.
>
> Ah -- maybe the issue is that... my guess is that "hbase classpath" is
> now trying to include the Hadoop dependencies using "hadoop
> classpath". But "hadoop classpath" was recursing right back because of
> that setting in hadoop-env. Basically you made a fork bomb - that
> explains the shape of the graph in Ganglia perfectly.
>
> -Todd
>
>>
>> Darrell.
>>
>>
>>>
>>> Raj
>>>
>>>
>>>
>>> >________________________________
>>> > From: Darrell Taylor <darrell.tay...@gmail.com>
>>> >To: common-user@hadoop.apache.org
>>> >Cc: Raj Vishwanathan <rajv...@yahoo.com>
>>> >Sent: Thursday, May 10, 2012 3:57 AM
>>> >Subject: Re: High load on datanode startup
>>> >
>>> >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>> >
>>> >> That's real weird..
>>> >>
>>> >> If you can reproduce this after a reboot, I'd recommend letting the DN
>>> >> run for a minute, and then capturing a "jstack <pid of dn>" as well as
>>> >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list.
>>> >
>>> >
>>> >What I did after the reboot this morning was to move the my dn, nn, and
>>> >mapred directories out of the the way, create a new one, formatted it, and
>>> >restarted the node, it's now happy.
>>> >
>>> >I'll try moving the directories back later and do the jstack as you
>>> suggest.
>>> >
>>> >
>>> >>
>>> >> What JVM/JDK are you using? What OS version?
>>> >>
>>> >
>>> >root@pl446:/# dpkg --get-selections | grep java
>>> >java-common                                     install
>>> >libjaxp1.3-java                                 install
>>> >libjaxp1.3-java-gcj                             install
>>> >libmysql-java                                   install
>>> >libxerces2-java                                 install
>>> >libxerces2-java-gcj                             install
>>> >sun-java6-bin                                   install
>>> >sun-java6-javadb                                install
>>> >sun-java6-jdk                                   install
>>> >sun-java6-jre                                   install
>>> >
>>> >root@pl446:/# java -version
>>> >java version "1.6.0_26"
>>> >Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
>>> >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>>> >
>>> >root@pl446:/# cat /etc/issue
>>> >Debian GNU/Linux 6.0 \n \l
>>> >
>>> >
>>> >
>>> >>
>>> >> -Todd
>>> >>
>>> >>
>>> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor
>>> >> <darrell.tay...@gmail.com> wrote:
>>> >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <rajv...@yahoo.com>
>>> >> wrote:
>>> >> >
>>> >> >> The picture either too small or too pixelated for my eyes :-)
>>> >> >>
>>> >> >
>>> >> > There should be a zoom option in the top right of the page that allows
>>> >> you
>>> >> > to view it full size
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> Can you login to the box and send the output of top? If the system is
>>> >> >> unresponsive, it has to be something more than an unbalanced hdfs
>>> >> cluster,
>>> >> >> methinks.
>>> >> >>
>>> >> >
>>> >> > Sorry, I'm unable to login to the box, it's completely unresponsive.
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> Raj
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> >________________________________
>>> >> >> > From: Darrell Taylor <darrell.tay...@gmail.com>
>>> >> >> >To: common-user@hadoop.apache.org; Raj Vishwanathan <
>>> rajv...@yahoo.com
>>> >> >
>>> >> >> >Sent: Wednesday, May 9, 2012 2:40 PM
>>> >> >> >Subject: Re: High load on datanode startup
>>> >> >> >
>>> >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <
>>> rajv...@yahoo.com>
>>> >> >> wrote:
>>> >> >> >
>>> >> >> >> When you say 'load', what do you mean? CPU load or something else?
>>> >> >> >>
>>> >> >> >
>>> >> >> >I mean in the unix sense of load average, i.e. top would show a
>>> load of
>>> >> >> >(currently) 376.
>>> >> >> >
>>> >> >> >Looking at Ganglia stats for the box it's not CPU load as such, the
>>> >> graphs
>>> >> >> >shows actual CPU usage as 30%, but the number of running processes
>>> is
>>> >> >> >simply growing in a linear manner - screen shot of ganglia page
>>> here :
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >>
>>> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >>
>>> >> >> >> Raj
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> >________________________________
>>> >> >> >> > From: Darrell Taylor <darrell.tay...@gmail.com>
>>> >> >> >> >To: common-user@hadoop.apache.org
>>> >> >> >> >Sent: Wednesday, May 9, 2012 9:52 AM
>>> >> >> >> >Subject: High load on datanode startup
>>> >> >> >> >
>>> >> >> >> >Hi,
>>> >> >> >> >
>>> >> >> >> >I wonder if someone could give some pointers with a problem I'm
>>> >> having?
>>> >> >> >> >
>>> >> >> >> >I have a 7 machine cluster setup for testing and we have been
>>> >> pouring
>>> >> >> data
>>> >> >> >> >into it for a week without issue, have learnt several thing along
>>> >> the
>>> >> >> way
>>> >> >> >> >and solved all the problems up to now by searching online, but
>>> now
>>> >> I'm
>>> >> >> >> >stuck.  One of the data nodes decided to have a load of 70+ this
>>> >> >> morning,
>>> >> >> >> >stopping datanode and tasktracker brought it back to normal, but
>>> >> every
>>> >> >> >> time
>>> >> >> >> >I start the datanode again the load shoots through the roof, and
>>> >> all I
>>> >> >> get
>>> >> >> >> >in the logs is :
>>> >> >> >> >
>>> >> >> >> >STARTUP_MSG: Starting DataNode
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >STARTUP_MSG:   host = pl464/10.20.16.64
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >STARTUP_MSG:   args = []
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >STARTUP_MSG:   version = 0.20.2-cdh3u3
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >STARTUP_MSG:   build =
>>> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze
>>> >> >> >> >-************************************************************/
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >2012-05-09 16:12:05,925 INFO
>>> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
>>> Configuration
>>> >> >> >> already
>>> >> >> >> >set up for Hadoop, not re-installing.
>>> >> >> >> >
>>> >> >> >> >2012-05-09 16:12:06,139 INFO
>>> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
>>> Configuration
>>> >> >> >> already
>>> >> >> >> >set up for Hadoop, not re-installing.
>>> >> >> >> >
>>> >> >> >> >Nothing else.
>>> >> >> >> >
>>> >> >> >> >The load seems to max out only 1 of the CPUs, but the machine
>>> >> becomes
>>> >> >> >> >*very* unresponsive
>>> >> >> >> >
>>> >> >> >> >Anybody got any pointers of things I can try?
>>> >> >> >> >
>>> >> >> >> >Thanks
>>> >> >> >> >Darrell.
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Todd Lipcon
>>> >> Software Engineer, Cloudera
>>> >>
>>> >
>>> >
>>> >
>>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera



-- 
Harsh J

Re: High load on datanode startup

Reply via email to