Doesn't look like the $HBASE_HOME/bin/hbase script runs "$HADOOP_HOME/bin/hadoop classpath" directly. Its classpath builder seems to add $HADOOP_HOME items manually via listing/etc.. Perhaps if hbase-env.sh has a HBASE_CLASSPATH that imports `hadoop classpath`, and the hadoop-env.sh has a `hbase classpath` this issue could happen.
I do know that `hbase classpath` may take very long and/or hang over network calls if there's a target/build directory inside of $HBASE_HOME, which causes it to use maven to generate a classpath instead of using a cached file/local gen. Generally doing mvn clean solves that up for me, whenever it happens over my installs. On Fri, May 11, 2012 at 3:02 PM, Todd Lipcon <t...@cloudera.com> wrote: > On Fri, May 11, 2012 at 2:29 AM, Darrell Taylor > <darrell.tay...@gmail.com> wrote: >> >> What I saw on the machine was thousands of recursive processes in ps of the >> form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean >> the processes up so had to kill them manually with some grep/xargs foo. >> Once this was all cleaned up and the hadoop-env.sh file removed the nodes >> seem to be happy again. > > Ah -- maybe the issue is that... my guess is that "hbase classpath" is > now trying to include the Hadoop dependencies using "hadoop > classpath". But "hadoop classpath" was recursing right back because of > that setting in hadoop-env. Basically you made a fork bomb - that > explains the shape of the graph in Ganglia perfectly. > > -Todd > >> >> Darrell. >> >> >>> >>> Raj >>> >>> >>> >>> >________________________________ >>> > From: Darrell Taylor <darrell.tay...@gmail.com> >>> >To: common-user@hadoop.apache.org >>> >Cc: Raj Vishwanathan <rajv...@yahoo.com> >>> >Sent: Thursday, May 10, 2012 3:57 AM >>> >Subject: Re: High load on datanode startup >>> > >>> >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <t...@cloudera.com> wrote: >>> > >>> >> That's real weird.. >>> >> >>> >> If you can reproduce this after a reboot, I'd recommend letting the DN >>> >> run for a minute, and then capturing a "jstack <pid of dn>" as well as >>> >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. >>> > >>> > >>> >What I did after the reboot this morning was to move the my dn, nn, and >>> >mapred directories out of the the way, create a new one, formatted it, and >>> >restarted the node, it's now happy. >>> > >>> >I'll try moving the directories back later and do the jstack as you >>> suggest. >>> > >>> > >>> >> >>> >> What JVM/JDK are you using? What OS version? >>> >> >>> > >>> >root@pl446:/# dpkg --get-selections | grep java >>> >java-common install >>> >libjaxp1.3-java install >>> >libjaxp1.3-java-gcj install >>> >libmysql-java install >>> >libxerces2-java install >>> >libxerces2-java-gcj install >>> >sun-java6-bin install >>> >sun-java6-javadb install >>> >sun-java6-jdk install >>> >sun-java6-jre install >>> > >>> >root@pl446:/# java -version >>> >java version "1.6.0_26" >>> >Java(TM) SE Runtime Environment (build 1.6.0_26-b03) >>> >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) >>> > >>> >root@pl446:/# cat /etc/issue >>> >Debian GNU/Linux 6.0 \n \l >>> > >>> > >>> > >>> >> >>> >> -Todd >>> >> >>> >> >>> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor >>> >> <darrell.tay...@gmail.com> wrote: >>> >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <rajv...@yahoo.com> >>> >> wrote: >>> >> > >>> >> >> The picture either too small or too pixelated for my eyes :-) >>> >> >> >>> >> > >>> >> > There should be a zoom option in the top right of the page that allows >>> >> you >>> >> > to view it full size >>> >> > >>> >> > >>> >> >> >>> >> >> Can you login to the box and send the output of top? If the system is >>> >> >> unresponsive, it has to be something more than an unbalanced hdfs >>> >> cluster, >>> >> >> methinks. >>> >> >> >>> >> > >>> >> > Sorry, I'm unable to login to the box, it's completely unresponsive. >>> >> > >>> >> > >>> >> >> >>> >> >> Raj >>> >> >> >>> >> >> >>> >> >> >>> >> >> >________________________________ >>> >> >> > From: Darrell Taylor <darrell.tay...@gmail.com> >>> >> >> >To: common-user@hadoop.apache.org; Raj Vishwanathan < >>> rajv...@yahoo.com >>> >> > >>> >> >> >Sent: Wednesday, May 9, 2012 2:40 PM >>> >> >> >Subject: Re: High load on datanode startup >>> >> >> > >>> >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan < >>> rajv...@yahoo.com> >>> >> >> wrote: >>> >> >> > >>> >> >> >> When you say 'load', what do you mean? CPU load or something else? >>> >> >> >> >>> >> >> > >>> >> >> >I mean in the unix sense of load average, i.e. top would show a >>> load of >>> >> >> >(currently) 376. >>> >> >> > >>> >> >> >Looking at Ganglia stats for the box it's not CPU load as such, the >>> >> graphs >>> >> >> >shows actual CPU usage as 30%, but the number of running processes >>> is >>> >> >> >simply growing in a linear manner - screen shot of ganglia page >>> here : >>> >> >> > >>> >> >> > >>> >> >> >>> >> >>> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> >> >>> >> >> >> Raj >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >________________________________ >>> >> >> >> > From: Darrell Taylor <darrell.tay...@gmail.com> >>> >> >> >> >To: common-user@hadoop.apache.org >>> >> >> >> >Sent: Wednesday, May 9, 2012 9:52 AM >>> >> >> >> >Subject: High load on datanode startup >>> >> >> >> > >>> >> >> >> >Hi, >>> >> >> >> > >>> >> >> >> >I wonder if someone could give some pointers with a problem I'm >>> >> having? >>> >> >> >> > >>> >> >> >> >I have a 7 machine cluster setup for testing and we have been >>> >> pouring >>> >> >> data >>> >> >> >> >into it for a week without issue, have learnt several thing along >>> >> the >>> >> >> way >>> >> >> >> >and solved all the problems up to now by searching online, but >>> now >>> >> I'm >>> >> >> >> >stuck. One of the data nodes decided to have a load of 70+ this >>> >> >> morning, >>> >> >> >> >stopping datanode and tasktracker brought it back to normal, but >>> >> every >>> >> >> >> time >>> >> >> >> >I start the datanode again the load shoots through the roof, and >>> >> all I >>> >> >> get >>> >> >> >> >in the logs is : >>> >> >> >> > >>> >> >> >> >STARTUP_MSG: Starting DataNode >>> >> >> >> > >>> >> >> >> > >>> >> >> >> >STARTUP_MSG: host = pl464/10.20.16.64 >>> >> >> >> > >>> >> >> >> > >>> >> >> >> >STARTUP_MSG: args = [] >>> >> >> >> > >>> >> >> >> > >>> >> >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >>> >> >> >> > >>> >> >> >> > >>> >> >> >> >STARTUP_MSG: build = >>> >> >> >> >>> >> >> >> >>> >> >> >>> >> >>> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze >>> >> >> >> >-************************************************************/ >>> >> >> >> > >>> >> >> >> > >>> >> >> >> >2012-05-09 16:12:05,925 INFO >>> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS >>> Configuration >>> >> >> >> already >>> >> >> >> >set up for Hadoop, not re-installing. >>> >> >> >> > >>> >> >> >> >2012-05-09 16:12:06,139 INFO >>> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS >>> Configuration >>> >> >> >> already >>> >> >> >> >set up for Hadoop, not re-installing. >>> >> >> >> > >>> >> >> >> >Nothing else. >>> >> >> >> > >>> >> >> >> >The load seems to max out only 1 of the CPUs, but the machine >>> >> becomes >>> >> >> >> >*very* unresponsive >>> >> >> >> > >>> >> >> >> >Anybody got any pointers of things I can try? >>> >> >> >> > >>> >> >> >> >Thanks >>> >> >> >> >Darrell. >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> >>> >> >>> >> >>> >> >>> >> -- >>> >> Todd Lipcon >>> >> Software Engineer, Cloudera >>> >> >>> > >>> > >>> > >>> > > > > -- > Todd Lipcon > Software Engineer, Cloudera -- Harsh J