Re: Namenode hangs on startup

Jianhui Zhang Mon, 02 Jul 2012 15:30:48 -0700

Thanks for helping.

I do have a secondary namenode.


And I'm starting one datanode at a time now. Previously I was using
start-dfs.sh to start NN and slaves. I observed pretty strange
behaviors:

1. If use start-dfs.sh, almost none of the DN joined. All seemed stuck.

2. If I start one DN at a time, SOME joined instantly, but some looked
hanging... The funny thing is: if I start over again, the behavior
changed for each node....

Any idea?

Thanks,
James


On Mon, Jul 2, 2012 at 12:56 PM, David Rosenstrauch <dar...@darose.net> wrote:
> A couple of thoughts:
>
> 1) It shouldn't take more than a couple of minutes for the data nodes to
> re-register after the name node comes back up.  If the data nodes aren't
> registering, you might try restarting the data node daemons on each machine.
> They should all register soon after.  (If not, then you have other problems
> involved.)
>
> 2) Re: the original symptom you reported about the namenode "hanging" at
> startup:  you should read up on the "secondary namenode" component in Hadoop
> (you might start at
> http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode
> and
> http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/).
> If you don't have a secondary namenode running in your cluster, that can
> have a significantly negative impact on namenode startup time.  (The
> namenode needs to replay the log at startup - which can take a long time if
> the log has gotten large.)
>
> HTH,
>
> DR
>
>
> On 07/02/2012 03:32 PM, Jianhui Zhang wrote:
>>
>> Hi Harsh,
>>
>> The web portal of the NN shows 0 nodes.
>>
>> Looking into each node's log, all nodes but one have been staying at:
>>
>> /************************************************************
>> STARTUP_MSG: Starting DataNode
>> STARTUP_MSG:   host = pipeline09.x.y.z/10.2.20.109
>> STARTUP_MSG:   args = []
>> STARTUP_MSG:   version = 0.20.205.0
>> STARTUP_MSG:   build =
>>
>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
>> -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
>> ************************************************************/
>> 2012-07-02 11:00:38,891 INFO
>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>> hadoop-metrics2.properties
>> 2012-07-02 11:00:38,906 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> MetricsSystem,sub=Stats registered.
>> 2012-07-02 11:00:38,908 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>> period at 10 second(s).
>> 2012-07-02 11:00:38,908 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics
>> system started
>> 2012-07-02 11:00:39,091 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> ugi registered.
>> 2012-07-02 11:00:39,094 WARN
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>> already exists!
>> =============================
>>
>> The one node passed that and reached:
>>
>> 2012-07-02 12:24:58,450 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
>> succeeded for blk_-2932980158620691384_129517
>>
>> Does that mean progress? It's been 1.5 hour since the start. And the
>> file system side is: 74515 files and directories, 28439 blocks =
>> 102954 total
>>
>> Thanks for  helping.
>> James
>>
>> On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>> Juanhui,
>>>
>>> It is merely waiting for the DNs to start, and to report its blocks
>>> in. This does not take long once the DNs are up and running. Do you
>>> see any Live Nodes yet?
>>>
>>> On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <jhzhang.em...@gmail.com>
>>> wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> Thanks for helping, especially at such earlier hours.
>>>>
>>>> After leaving it overnight, during which period nothing happened in
>>>> the log, I restarted this morning. This time, it passed the previously
>>>> stuck point, and reached all the way to "IPC Server handler..
>>>> starting", in Safe Mode. So it looks more promising now.
>>>>
>>>> But it's in a state of:
>>>>
>>>> "The ratio of reported blocks 0.0000 has not reached the threshold
>>>> 0.9990. Safe mode will be turned off automatically."
>>>>
>>>> Does that mean the NN is waiting for DNs's communications/updates? How
>>>> can I tell whether it's stuck or just slow?
>>>>
>>>> The NN log is at: http://pastebin.com/5fvRfRSD
>>>>
>>>> The jstack output is at: http://pastebin.com/RnDXWrtc
>>>>
>>>> The configurations are really basic:
>>>>
>>>> core-site.xml:
>>>>
>>>> <configuration>
>>>>    <property>
>>>>      <name>fs.default.name</name>
>>>>      <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value>
>>>>      <final>true</final>
>>>>    </property>
>>>>    <property>
>>>>      <name>io.file.buffer.size</name>
>>>>      <value>65536</value>
>>>>    </property>
>>>> </configuration>
>>>>
>>>> It's the same for all nodes.
>>>>
>>>> Again, appreciate your help.
>>>>
>>>> Thanks,
>>>> James
>>>>
>>>> On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>>
>>>>> Jianhui,
>>>>>
>>>>> Can you pastebin.com the output of your "jstack <NN PID>" command
>>>>> after its hung, and pass us the paste link please? It looks to me like
>>>>> it may have just been merging/saving the image, and that may be slow
>>>>> but it depends on how long did you have to wait around to see NN
>>>>> resume and begin properly.
>>>>>
>>>>> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Apache Hadoop 0.20.205.
>>>>>>
>>>>>> I'm trying to restart NN and it always hangs at the very beginning.
>>>>>> The only logs I've got are:
>>>>>>
>>>>>> /************************************************************
>>>>>> STARTUP_MSG: Starting NameNode
>>>>>> STARTUP_MSG:   host = host/ip
>>>>>> STARTUP_MSG:   args = []
>>>>>> STARTUP_MSG:   version = 0.20.205.0
>>>>>> STARTUP_MSG:   build =
>>>>>>
>>>>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
>>>>>> -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
>>>>>> ************************************************************/
>>>>>> 2012-07-02 01:33:01,281 INFO
>>>>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>>>>>> hadoop-metrics2.properties
>>>>>> 2012-07-02 01:33:01,290 INFO
>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>> MetricsSystem,sub=Stats registered.
>>>>>> 2012-07-02 01:33:01,292 INFO
>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>>>>>> period at 10 second(s).
>>>>>> 2012-07-02 01:33:01,292 INFO
>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
>>>>>> system started
>>>>>> 2012-07-02 01:33:01,434 INFO
>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>> ugi registered.
>>>>>> 2012-07-02 01:33:01,436 WARN
>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>>>>>> already exists!
>>>>>> 2012-07-02 01:33:01,441 INFO
>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>> jvm registered.
>>>>>> 2012-07-02 01:33:01,441 INFO
>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>> NameNode registered.
>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type
>>>>>>        = 64-bit
>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max
>>>>>> memory = 314.0275 MB
>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>>>>>> capacity      = 2^25 = 33554432 entries
>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>>>>>> recommended=33554432, actual=33554432
>>>>>> 2012-07-02 01:33:01,546 INFO
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner
>>>>>> 2012-07-02 01:33:01,546 INFO
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>> supergroup=supergroup
>>>>>> 2012-07-02 01:33:01,546 INFO
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>> isPermissionEnabled=true
>>>>>> 2012-07-02 01:33:01,550 INFO
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>> dfs.block.invalidate.limit=100
>>>>>> 2012-07-02 01:33:01,550 INFO
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
>>>>>> accessTokenLifetime=0 min(s)
>>>>>> 2012-07-02 01:33:01,787 INFO
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>>>>>> FSNamesystemStateMBean and NameNodeMXBean
>>>>>> 2012-07-02 01:33:01,802 INFO
>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
>>>>>> occuring more than 10 times
>>>>>> 2012-07-02 01:33:01,811 INFO
>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032
>>>>>> 2012-07-02 01:33:02,406 INFO
>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under
>>>>>> construction = 0
>>>>>> 2012-07-02 01:33:02,406 INFO
>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
>>>>>> 2553316 loaded in 0 seconds.
>>>>>> 2012-07-02 01:33:02,410 INFO
>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Edits file
>>>>>> /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0
>>>>>> seconds.
>>>>>>
>>>>>> ====================================
>>>>>>
>>>>>> It hangs thereafter.... I wonder if anybody has seen this before?
>>>>>>
>>>>>> Some background: I shut down DFS and MR while there were still jobs
>>>>>> running. Some MR jobs were hanging, so I manually killed the children
>>>>>> JVMs after the shutdown. Not sure how such actions would affect NN
>>>>>> startup.
>>>>>>
>>>>>> Any help would be appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> James
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>
>>>
>>>
>>>
>>> --
>>> Harsh J
>
>
>

Re: Namenode hangs on startup

Reply via email to