A couple of thoughts:

1) It shouldn't take more than a couple of minutes for the data nodes to re-register after the name node comes back up. If the data nodes aren't registering, you might try restarting the data node daemons on each machine. They should all register soon after. (If not, then you have other problems involved.)

2) Re: the original symptom you reported about the namenode "hanging" at startup: you should read up on the "secondary namenode" component in Hadoop (you might start at http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode and http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/). If you don't have a secondary namenode running in your cluster, that can have a significantly negative impact on namenode startup time. (The namenode needs to replay the log at startup - which can take a long time if the log has gotten large.)

HTH,

DR

On 07/02/2012 03:32 PM, Jianhui Zhang wrote:
Hi Harsh,

The web portal of the NN shows 0 nodes.

Looking into each node's log, all nodes but one have been staying at:

/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = pipeline09.x.y.z/10.2.20.109
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.205.0
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
-r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
************************************************************/
2012-07-02 11:00:38,891 INFO
org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
hadoop-metrics2.properties
2012-07-02 11:00:38,906 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2012-07-02 11:00:38,908 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2012-07-02 11:00:38,908 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics
system started
2012-07-02 11:00:39,091 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
ugi registered.
2012-07-02 11:00:39,094 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
already exists!
=============================

The one node passed that and reached:

2012-07-02 12:24:58,450 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_-2932980158620691384_129517

Does that mean progress? It's been 1.5 hour since the start. And the
file system side is: 74515 files and directories, 28439 blocks =
102954 total

Thanks for  helping.
James

On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <ha...@cloudera.com> wrote:
Juanhui,

It is merely waiting for the DNs to start, and to report its blocks
in. This does not take long once the DNs are up and running. Do you
see any Live Nodes yet?

On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <jhzhang.em...@gmail.com> wrote:
Hi folks,

Thanks for helping, especially at such earlier hours.

After leaving it overnight, during which period nothing happened in
the log, I restarted this morning. This time, it passed the previously
stuck point, and reached all the way to "IPC Server handler..
starting", in Safe Mode. So it looks more promising now.

But it's in a state of:

"The ratio of reported blocks 0.0000 has not reached the threshold
0.9990. Safe mode will be turned off automatically."

Does that mean the NN is waiting for DNs's communications/updates? How
can I tell whether it's stuck or just slow?

The NN log is at: http://pastebin.com/5fvRfRSD

The jstack output is at: http://pastebin.com/RnDXWrtc

The configurations are really basic:

core-site.xml:

<configuration>
   <property>
     <name>fs.default.name</name>
     <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value>
     <final>true</final>
   </property>
   <property>
     <name>io.file.buffer.size</name>
     <value>65536</value>
   </property>
</configuration>

It's the same for all nodes.

Again, appreciate your help.

Thanks,
James

On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote:
Jianhui,

Can you pastebin.com the output of your "jstack <NN PID>" command
after its hung, and pass us the paste link please? It looks to me like
it may have just been merging/saving the image, and that may be slow
but it depends on how long did you have to wait around to see NN
resume and begin properly.

On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com> wrote:
Hi,

Apache Hadoop 0.20.205.

I'm trying to restart NN and it always hangs at the very beginning.
The only logs I've got are:

/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = host/ip
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.205.0
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
-r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
************************************************************/
2012-07-02 01:33:01,281 INFO
org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
hadoop-metrics2.properties
2012-07-02 01:33:01,290 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2012-07-02 01:33:01,292 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2012-07-02 01:33:01,292 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
system started
2012-07-02 01:33:01,434 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
ugi registered.
2012-07-02 01:33:01,436 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
already exists!
2012-07-02 01:33:01,441 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
jvm registered.
2012-07-02 01:33:01,441 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
NameNode registered.
2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type
       = 64-bit
2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max
memory = 314.0275 MB
2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
capacity      = 2^25 = 33554432 entries
2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
recommended=33554432, actual=33554432
2012-07-02 01:33:01,546 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner
2012-07-02 01:33:01,546 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
supergroup=supergroup
2012-07-02 01:33:01,546 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=true
2012-07-02 01:33:01,550 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
dfs.block.invalidate.limit=100
2012-07-02 01:33:01,550 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
accessTokenLifetime=0 min(s)
2012-07-02 01:33:01,787 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStateMBean and NameNodeMXBean
2012-07-02 01:33:01,802 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
occuring more than 10 times
2012-07-02 01:33:01,811 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032
2012-07-02 01:33:02,406 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files under
construction = 0
2012-07-02 01:33:02,406 INFO
org.apache.hadoop.hdfs.server.common.Storage: Image file of size
2553316 loaded in 0 seconds.
2012-07-02 01:33:02,410 INFO
org.apache.hadoop.hdfs.server.common.Storage: Edits file
/apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0
seconds.

====================================

It hangs thereafter.... I wonder if anybody has seen this before?

Some background: I shut down DFS and MR while there were still jobs
running. Some MR jobs were hanging, so I manually killed the children
JVMs after the shutdown. Not sure how such actions would affect NN
startup.

Any help would be appreciated.

Thanks,
James



--
Harsh J



--
Harsh J


Reply via email to