[ https://issues.apache.org/jira/browse/HDFS-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435937#comment-13435937 ]
Vinay commented on HDFS-3618: ----------------------------- * Regarding StreamPumper thread getting hanged: I tried to reproduce issue by creating a small pause using debug point before starting the thread for reading streams inside {{SshFenceByTcpPort.execCommand(Session, String)}}, before executing {{errPumper.start();}}. By the time I release the debug point, {{exec}} was closed. And {{errPumper}} got hanged. *# We can start the threads for reading streams only if {{exec}} is not closed. *# We can wait till {{exec}} is closed before getting exitStatus. * Regarding nc command not found case I think we can handle as following *# If the command present and process running with specified port, then nc exit code will be 0 *# Command present and no process is running with that port, then nc exit code will be 1. *# If command itself will not present then exit code will be 127. *# If command dont have permissions then exit code will be 126 ( as Uma mentioned) Here we need to treat only return code 1 as success, others should be treated as failed. One configuration we can introduce to specify the alternative {{nc}} command (netcat) in case its not present in the machine. > SSH fencing option may incorrectly succeed if nc (netcat) command not present > ----------------------------------------------------------------------------- > > Key: HDFS-3618 > URL: https://issues.apache.org/jira/browse/HDFS-3618 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.1.0-alpha > Reporter: Brahma Reddy Battula > Attachments: zkfc_threaddump.out, zkfc.txt > > > Started NN's and zkfc's in Suse11. > Suse11 will have netcat installation and netcat -z will work(but nc -z wn't > work).. > While executing following command, got command not found hence rc will be > other than zero and assuming that server was down..Here we are ending up > without checking whether service is down or not.. > {code} > LOG.info( > "Indeterminate response from trying to kill service. " + > "Verifying whether it is running using nc..."); > rc = execCommand(session, "nc -z " + serviceAddr.getHostName() + > " " + serviceAddr.getPort()); > if (rc == 0) { > // the service is still listening - we are unable to fence > LOG.warn("Unable to fence - it is running but we cannot kill it"); > return false; > } else { > LOG.info("Verified that the service is down."); > return true; > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira