The long startup time after the restart looks like it was caused because the
SecondaryNameNode hasn't been able to roll the edits log for some time. Can
you post your Namenode log from around the same time in this
SecondaryNameNode log (2011-07-21 16:00-16:30)?

-Joey

On Fri, Jul 22, 2011 at 8:29 AM, Rahul Das <rahul.h...@gmail.com> wrote:

> Yes I have a secondary Namenode running. Here are the log for
> SecondaryNamenode
>
> 2011-07-21 16:02:47,908 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Edits file /home/hadoop/tmp/dfs/namesecondary/current/edits of size 12751835
> edits # 138217 loaded in 1581 seconds.
> 2011-07-21 16:03:21,925 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 2045516451 saved in 29 seconds.
> 2011-07-21 16:03:24,974 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
> 0 Total time for transactions(ms): 0Number of transactions batched in Syncs:
> 0 Number of syncs: 0 SyncTimes(ms): 0
> 2011-07-21 16:03:25,545 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
> xx.xx.xx.xx:50070putimage=1&port=50090&machine=xx.xx.xx.xx&token=-18:1554828842:0:1311242583000:1311240481442
> 2011-07-21 16:29:24,356 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
> doCheckpoint:
> 2011-07-21 16:29:24,358 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
> java.io.IOException: Call to xx.xx.xx.xx:9000 failed on local exception:
> java.io.IOException: Connection reset by peer
>
> Regards,
> Rahul
>
>
> On Fri, Jul 22, 2011 at 5:40 PM, Joey Echeverria <j...@cloudera.com>wrote:
>
>> Do you have an instance of the SecondaryNamenode in your cluster?
>>
>> -Joey
>>
>>
>> On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das <rahul.h...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running a Hadoop cluster with 20 Data node. Yesterday I found that
>>> the Namenode was not responding ( No write/read to HDFS is happening). It
>>> got stuck for few hours, then I shut down the Namenode and found the
>>> following error from the Name node log.
>>>
>>> 2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
>>> Responder, call
>>> getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
>>> xx.xx.xx.xx:13568: output error
>>>
>>> This error was coming for every data node and data nodes are not able to
>>> communicate with the Name node
>>>
>>> After I restart the Namenode
>>>
>>> 2011-07-21 16:31:54,110 INFO
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
>>> 2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
>>> Initializing RPC Metrics with hostName=NameNode, port=9000
>>> 2011-07-21 16:31:54,223 INFO
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
>>> xx.xx.xx.xx:9000
>>> 2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>>> Initializing JVM Metrics with processName=NameNode, sessionId=null
>>> 2011-07-21 16:31:54,226 INFO
>>> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
>>> NameNodeMeterics using context
>>> object:org.apache.hadoop.metrics.spi.NullContext
>>> 2011-07-21 16:31:54,280 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
>>> 2011-07-21 16:31:54,280 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
>>> 2011-07-21 16:31:54,280 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>> isPermissionEnabled=false
>>> 2011-07-21 16:31:54,287 INFO
>>> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
>>> Initializing FSNamesystemMetrics using context
>>> object:org.apache.hadoop.metrics.spi.NullContext
>>> 2011-07-21 16:31:54,289 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>>> FSNamesystemStatusMBean
>>> 2011-07-21 16:31:54,880 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 15817482
>>> 2011-07-21 16:34:38,463 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under
>>> construction = 82
>>> 2011-07-21 16:34:41,177 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
>>> 2042701824 loaded in 166 seconds.
>>> 2011-07-21 16:58:07,624 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Edits file
>>> /home/hadoop/current/edits of size 12751835 edits # 138217 loaded in 1406
>>> seconds.
>>>
>>> And it goes for a long halt. After about an hour it starts working again.
>>>
>>> My question is when the error "IPC Server Responde" comes and is there a
>>> way to deal with it.
>>> Also if my Namenode is busy doing something then what is the way to find
>>> out what it is doing.
>>>
>>> Regards,
>>> Rahul
>>
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>>
>


-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Reply via email to