Thanks for helping. I do have a secondary namenode.
And I'm starting one datanode at a time now. Previously I was using start-dfs.sh to start NN and slaves. I observed pretty strange behaviors: 1. If use start-dfs.sh, almost none of the DN joined. All seemed stuck. 2. If I start one DN at a time, SOME joined instantly, but some looked hanging... The funny thing is: if I start over again, the behavior changed for each node.... Any idea? Thanks, James On Mon, Jul 2, 2012 at 12:56 PM, David Rosenstrauch <dar...@darose.net> wrote: > A couple of thoughts: > > 1) It shouldn't take more than a couple of minutes for the data nodes to > re-register after the name node comes back up. If the data nodes aren't > registering, you might try restarting the data node daemons on each machine. > They should all register soon after. (If not, then you have other problems > involved.) > > 2) Re: the original symptom you reported about the namenode "hanging" at > startup: you should read up on the "secondary namenode" component in Hadoop > (you might start at > http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode > and > http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/). > If you don't have a secondary namenode running in your cluster, that can > have a significantly negative impact on namenode startup time. (The > namenode needs to replay the log at startup - which can take a long time if > the log has gotten large.) > > HTH, > > DR > > > On 07/02/2012 03:32 PM, Jianhui Zhang wrote: >> >> Hi Harsh, >> >> The web portal of the NN shows 0 nodes. >> >> Looking into each node's log, all nodes but one have been staying at: >> >> /************************************************************ >> STARTUP_MSG: Starting DataNode >> STARTUP_MSG: host = pipeline09.x.y.z/10.2.20.109 >> STARTUP_MSG: args = [] >> STARTUP_MSG: version = 0.20.205.0 >> STARTUP_MSG: build = >> >> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 >> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >> ************************************************************/ >> 2012-07-02 11:00:38,891 INFO >> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >> hadoop-metrics2.properties >> 2012-07-02 11:00:38,906 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> MetricsSystem,sub=Stats registered. >> 2012-07-02 11:00:38,908 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >> period at 10 second(s). >> 2012-07-02 11:00:38,908 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics >> system started >> 2012-07-02 11:00:39,091 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> ugi registered. >> 2012-07-02 11:00:39,094 WARN >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >> already exists! >> ============================= >> >> The one node passed that and reached: >> >> 2012-07-02 12:24:58,450 INFO >> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification >> succeeded for blk_-2932980158620691384_129517 >> >> Does that mean progress? It's been 1.5 hour since the start. And the >> file system side is: 74515 files and directories, 28439 blocks = >> 102954 total >> >> Thanks for helping. >> James >> >> On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <ha...@cloudera.com> wrote: >>> >>> Juanhui, >>> >>> It is merely waiting for the DNs to start, and to report its blocks >>> in. This does not take long once the DNs are up and running. Do you >>> see any Live Nodes yet? >>> >>> On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <jhzhang.em...@gmail.com> >>> wrote: >>>> >>>> Hi folks, >>>> >>>> Thanks for helping, especially at such earlier hours. >>>> >>>> After leaving it overnight, during which period nothing happened in >>>> the log, I restarted this morning. This time, it passed the previously >>>> stuck point, and reached all the way to "IPC Server handler.. >>>> starting", in Safe Mode. So it looks more promising now. >>>> >>>> But it's in a state of: >>>> >>>> "The ratio of reported blocks 0.0000 has not reached the threshold >>>> 0.9990. Safe mode will be turned off automatically." >>>> >>>> Does that mean the NN is waiting for DNs's communications/updates? How >>>> can I tell whether it's stuck or just slow? >>>> >>>> The NN log is at: http://pastebin.com/5fvRfRSD >>>> >>>> The jstack output is at: http://pastebin.com/RnDXWrtc >>>> >>>> The configurations are really basic: >>>> >>>> core-site.xml: >>>> >>>> <configuration> >>>> <property> >>>> <name>fs.default.name</name> >>>> <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value> >>>> <final>true</final> >>>> </property> >>>> <property> >>>> <name>io.file.buffer.size</name> >>>> <value>65536</value> >>>> </property> >>>> </configuration> >>>> >>>> It's the same for all nodes. >>>> >>>> Again, appreciate your help. >>>> >>>> Thanks, >>>> James >>>> >>>> On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote: >>>>> >>>>> Jianhui, >>>>> >>>>> Can you pastebin.com the output of your "jstack <NN PID>" command >>>>> after its hung, and pass us the paste link please? It looks to me like >>>>> it may have just been merging/saving the image, and that may be slow >>>>> but it depends on how long did you have to wait around to see NN >>>>> resume and begin properly. >>>>> >>>>> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Apache Hadoop 0.20.205. >>>>>> >>>>>> I'm trying to restart NN and it always hangs at the very beginning. >>>>>> The only logs I've got are: >>>>>> >>>>>> /************************************************************ >>>>>> STARTUP_MSG: Starting NameNode >>>>>> STARTUP_MSG: host = host/ip >>>>>> STARTUP_MSG: args = [] >>>>>> STARTUP_MSG: version = 0.20.205.0 >>>>>> STARTUP_MSG: build = >>>>>> >>>>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 >>>>>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >>>>>> ************************************************************/ >>>>>> 2012-07-02 01:33:01,281 INFO >>>>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >>>>>> hadoop-metrics2.properties >>>>>> 2012-07-02 01:33:01,290 INFO >>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>> MetricsSystem,sub=Stats registered. >>>>>> 2012-07-02 01:33:01,292 INFO >>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >>>>>> period at 10 second(s). >>>>>> 2012-07-02 01:33:01,292 INFO >>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics >>>>>> system started >>>>>> 2012-07-02 01:33:01,434 INFO >>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>> ugi registered. >>>>>> 2012-07-02 01:33:01,436 WARN >>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >>>>>> already exists! >>>>>> 2012-07-02 01:33:01,441 INFO >>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>> jvm registered. >>>>>> 2012-07-02 01:33:01,441 INFO >>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>> NameNode registered. >>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type >>>>>> = 64-bit >>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max >>>>>> memory = 314.0275 MB >>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>>>>> capacity = 2^25 = 33554432 entries >>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>>>>> recommended=33554432, actual=33554432 >>>>>> 2012-07-02 01:33:01,546 INFO >>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner >>>>>> 2012-07-02 01:33:01,546 INFO >>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>> supergroup=supergroup >>>>>> 2012-07-02 01:33:01,546 INFO >>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>> isPermissionEnabled=true >>>>>> 2012-07-02 01:33:01,550 INFO >>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>> dfs.block.invalidate.limit=100 >>>>>> 2012-07-02 01:33:01,550 INFO >>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), >>>>>> accessTokenLifetime=0 min(s) >>>>>> 2012-07-02 01:33:01,787 INFO >>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered >>>>>> FSNamesystemStateMBean and NameNodeMXBean >>>>>> 2012-07-02 01:33:01,802 INFO >>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names >>>>>> occuring more than 10 times >>>>>> 2012-07-02 01:33:01,811 INFO >>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032 >>>>>> 2012-07-02 01:33:02,406 INFO >>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under >>>>>> construction = 0 >>>>>> 2012-07-02 01:33:02,406 INFO >>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size >>>>>> 2553316 loaded in 0 seconds. >>>>>> 2012-07-02 01:33:02,410 INFO >>>>>> org.apache.hadoop.hdfs.server.common.Storage: Edits file >>>>>> /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0 >>>>>> seconds. >>>>>> >>>>>> ==================================== >>>>>> >>>>>> It hangs thereafter.... I wonder if anybody has seen this before? >>>>>> >>>>>> Some background: I shut down DFS and MR while there were still jobs >>>>>> running. Some MR jobs were hanging, so I manually killed the children >>>>>> JVMs after the shutdown. Not sure how such actions would affect NN >>>>>> startup. >>>>>> >>>>>> Any help would be appreciated. >>>>>> >>>>>> Thanks, >>>>>> James >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Harsh J >>> >>> >>> >>> >>> -- >>> Harsh J > > >