[ 
https://issues.apache.org/jira/browse/HDFS-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409314#comment-13409314
 ] 

Brahma Reddy Battula commented on HDFS-3618:
--------------------------------------------



Started cluster and after sometime SNN is not switching to Active state..

Then Checked zkfc logs,zkfc got hang while reading error Stream while executing 
above command i.e nc -z ,Hence SNN is not switching to Active state..


There are no logs getting logged but zkfc is running. Then checked threaddump 
and observed "StreamPumper for STDERR" is in timed waiting and not coming out 
for ever.

{noformat}
2012-07-06 11:45:44,698 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_SERVICE_REQUEST sent
2012-07-06 11:45:44,698 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_SERVICE_ACCEPT received
2012-07-06 11:45:44,709 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Authentications that can continue: publickey,keyboard-interactive,password
2012-07-06 11:45:44,709 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next 
authentication method: publickey
2012-07-06 11:45:46,500 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Authentication succeeded (publickey).
2012-07-06 11:45:46,501 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected 
to *****-233
2012-07-06 11:45:48,021 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking 
for process running on port 8020
2012-07-06 11:45:49,635 INFO org.apache.hadoop.ha.SshFenceByTcpPort: 
Indeterminate response from trying to kill service. Verifying whether it is 
running using nc...
{noformat}


Thought problem with nc -z but observed samething in logs where it was read 
error stream and processed by making SNN as active..(this is what I raised as 
issue)

{noformat}
2012-07-06 01:24:08,921 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Authentication succeeded (publickey).
2012-07-06 01:24:08,921 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected 
to HOST-*****-233
2012-07-06 01:24:08,982 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking 
for process running on port 8020
2012-07-06 01:24:09,654 INFO org.apache.hadoop.ha.SshFenceByTcpPort: 
Indeterminate response from trying to kill service. Verifying whether it is 
running using nc...
2012-07-06 01:24:09,771 WARN org.apache.hadoop.ha.SshFenceByTcpPort: nc -z 
HOST-*****-233 8020 via ssh: bash: nc: command not found
2012-07-06 01:24:09,773 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Verified 
that the service is down.
2012-07-06 01:24:09,773 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Disconnecting from HOST-*****-233 port 22
2012-07-06 01:24:09,778 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing 
successful by method org.apache.hadoop.ha.SshFenceByTcpPort(null) ======
2012-07-06 01:24:09,778 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing 
znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the local node is 
the most recent active...
2012-07-06 01:24:09,790 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Caught an exception, leaving main loop due to Socket closed
2012-07-06 01:24:09,835 INFO org.apache.hadoop.ha.ZKFailoverController: Trying 
to make NameNode at HOST-*****-20/*****.20:8020 active...
2012-07-06 01:24:29,959 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at HOST-*****-20/*****.20:8020 to active 
state
{noformat}

 
Here one more doubt is,initial it is able to read error stream and going head 
but later it got hang..? 

then gone through following link and got java behavior

http://oreilly.com/pub/h/1092

When reading any stream the java methods will block when nothing is available 
and the stream is left open..

Please correct me If I am going wrong..











                
> IF RC is other than zero, we are assuming that Service is down (What if NC 
> command itself not found)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3618
>                 URL: https://issues.apache.org/jira/browse/HDFS-3618
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>            Reporter: Brahma Reddy Battula
>
> Started NN's and zkfc's in Suse11.
> Suse11 will have netcat installation and netcat -z will work(but nc -z wn't 
> work)..
> While executing following command, got command not found hence rc will be 
> other than zero and assuming that server was down..Here we are ending up 
> without checking whether service is down or not..
> {code}
> LOG.info(
>             "Indeterminate response from trying to kill service. " +
>             "Verifying whether it is running using nc...");
>         rc = execCommand(session, "nc -z " + serviceAddr.getHostName() +
>             " " + serviceAddr.getPort());
>         if (rc == 0) {
>           // the service is still listening - we are unable to fence
>           LOG.warn("Unable to fence - it is running but we cannot kill it");
>           return false;
>         } else {
>           LOG.info("Verified that the service is down.");
>           return true;          
>         }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to