[ https://issues.apache.org/jira/browse/HDFS-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409314#comment-13409314 ]
Brahma Reddy Battula commented on HDFS-3618: -------------------------------------------- Started cluster and after sometime SNN is not switching to Active state.. Then Checked zkfc logs,zkfc got hang while reading error Stream while executing above command i.e nc -z ,Hence SNN is not switching to Active state.. There are no logs getting logged but zkfc is running. Then checked threaddump and observed "StreamPumper for STDERR" is in timed waiting and not coming out for ever. {noformat} 2012-07-06 11:45:44,698 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_REQUEST sent 2012-07-06 11:45:44,698 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received 2012-07-06 11:45:44,709 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can continue: publickey,keyboard-interactive,password 2012-07-06 11:45:44,709 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: publickey 2012-07-06 11:45:46,500 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded (publickey). 2012-07-06 11:45:46,501 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected to *****-233 2012-07-06 11:45:48,021 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking for process running on port 8020 2012-07-06 11:45:49,635 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Indeterminate response from trying to kill service. Verifying whether it is running using nc... {noformat} Thought problem with nc -z but observed samething in logs where it was read error stream and processed by making SNN as active..(this is what I raised as issue) {noformat} 2012-07-06 01:24:08,921 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded (publickey). 2012-07-06 01:24:08,921 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected to HOST-*****-233 2012-07-06 01:24:08,982 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking for process running on port 8020 2012-07-06 01:24:09,654 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Indeterminate response from trying to kill service. Verifying whether it is running using nc... 2012-07-06 01:24:09,771 WARN org.apache.hadoop.ha.SshFenceByTcpPort: nc -z HOST-*****-233 8020 via ssh: bash: nc: command not found 2012-07-06 01:24:09,773 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Verified that the service is down. 2012-07-06 01:24:09,773 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from HOST-*****-233 port 22 2012-07-06 01:24:09,778 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing successful by method org.apache.hadoop.ha.SshFenceByTcpPort(null) ====== 2012-07-06 01:24:09,778 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the local node is the most recent active... 2012-07-06 01:24:09,790 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught an exception, leaving main loop due to Socket closed 2012-07-06 01:24:09,835 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at HOST-*****-20/*****.20:8020 active... 2012-07-06 01:24:29,959 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at HOST-*****-20/*****.20:8020 to active state {noformat} Here one more doubt is,initial it is able to read error stream and going head but later it got hang..? then gone through following link and got java behavior http://oreilly.com/pub/h/1092 When reading any stream the java methods will block when nothing is available and the stream is left open.. Please correct me If I am going wrong.. > IF RC is other than zero, we are assuming that Service is down (What if NC > command itself not found) > ---------------------------------------------------------------------------------------------------- > > Key: HDFS-3618 > URL: https://issues.apache.org/jira/browse/HDFS-3618 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Reporter: Brahma Reddy Battula > > Started NN's and zkfc's in Suse11. > Suse11 will have netcat installation and netcat -z will work(but nc -z wn't > work).. > While executing following command, got command not found hence rc will be > other than zero and assuming that server was down..Here we are ending up > without checking whether service is down or not.. > {code} > LOG.info( > "Indeterminate response from trying to kill service. " + > "Verifying whether it is running using nc..."); > rc = execCommand(session, "nc -z " + serviceAddr.getHostName() + > " " + serviceAddr.getPort()); > if (rc == 0) { > // the service is still listening - we are unable to fence > LOG.warn("Unable to fence - it is running but we cannot kill it"); > return false; > } else { > LOG.info("Verified that the service is down."); > return true; > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira