[
https://issues.apache.org/jira/browse/HDFS-7539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251065#comment-14251065
]
hoelog commented on HDFS-7539:
------------------------------
Actually NN hangs 1~2 minutes because of GC.
This problem may not appear when NN have enough memory.
> Namenode can't leave safemode because of Datanodes' IPC socket timeout
> ----------------------------------------------------------------------
>
> Key: HDFS-7539
> URL: https://issues.apache.org/jira/browse/HDFS-7539
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, namenode
> Affects Versions: 2.5.1
> Environment: 1 master, 1 secondary and 128 slaves, each node has x24
> cores, 48GB memory. fsimage is 4GB.
> Reporter: hoelog
>
> During the starting of namenode, data nodes seem waiting namenode's response
> through IPC to register block pools.
> here is DN's log -
> {code}
> 2014-12-16 20:28:09,064 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Acknowledging ACTIVE Namenode Block pool
> BP-877672386-10.114.130.143-1412666752827 (Datanode Uuid
> 2117395f-e034-4b4a-adec-8a28464f4796) service to NN.x.com/10.x.x143:9000
> {code}
> But namenode is too busy to responde it, and datanodes occur socket timeout -
> default is 1 minute.
> {code}
> 2014-12-16 20:29:09,857 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> IOException in offerService
> java.net.SocketTimeoutException: Call From DN1.x.com/10.x.x.84 to
> NN.x.com:9000 failed on socket timeout exception:
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/10.x.x.84:57924 remote=NN.x.com/10.x.x.143:9000]; For more details
> see: http://wiki.apache.org/hadoop/SocketTimeout
> {code}
> same events repeat and eventually NN drops most connecting trials from DN. So
> NN can't leave safemode.
> DN's log -
> {code}
> 2014-12-16 20:32:25,895 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> IOException in offerService
> java.io.IOException: failed on local exception java.io.ioexception connection
> reset by peer
> {code}
> There is no troubles in the network, configuration or servers. I think NN is
> too busy to respond to DN in a minute.
> I configured "ipc.ping.interval" to 15 mins In the core-site.xml, and that
> was helpful for my cluster.
> {code}
> <property>
> <name>ipc.ping.interval</name>
> <value>900000</value>
> </property>
> {code}
> In my cluster, namenode responded 1 min ~ 5 mins for the DNs' request.
> It will be helpful if there is more elegant solution.
> {code}
> 2014-12-16 23:28:16,598 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Acknowledging ACTIVE Namenode Block pool
> BP-877672386-10.x.x.143-1412666752827 (Datanode Uuid
> c4f7beea-b8e9-404f-bc81-6e87e37263d2) service to NN/10.x.x.143:9000
> 2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Sent 1 blockreports 2090961 blocks total. Took 1690 msec to generate and
> 193738 msecs for RPC and NN processing. Got back commands
> org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@20e68e11
> 2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Got finalize command for block pool BP-877672386-10.x.x.143-1412666752827
> 2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: Computing capacity
> for map BlockMap
> 2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: VM type =
> 64-bit
> 2014-12-16 23:31:32,044 INFO org.apache.hadoop.util.GSet: 0.5% max memory 3.6
> GB = 18.2 MB
> 2014-12-16 23:31:32,045 INFO org.apache.hadoop.util.GSet: capacity =
> 2^21 = 2097152 entries
> 2014-12-16 23:31:32,046 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block
> Verification Scanner initialized with interval 504 hours for block pool
> BP-877672386-10.114.130.143-1412666752827
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)