[
https://issues.apache.org/jira/browse/HDFS-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-3081:
------------------------------
Attachment: hdfs-3081.txt
Attached patch fixes the problem.
I am still using nc to verify that it's down, since it's possible that, if the
user is wrong, then fuser won't be able to find the listening process. (it has
to be either the same user or root).
I tested locally by using my external hostname and verifying the following in
the logs:
12/03/19 21:40:19 INFO ha.SshFenceByTcpPort: Connected to todd-w510
12/03/19 21:40:19 INFO ha.SshFenceByTcpPort: Looking for process running on
port 8020
12/03/19 21:40:19 DEBUG ha.SshFenceByTcpPort: Running cmd:
PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020
12/03/19 21:40:19 INFO ha.SshFenceByTcpPort: Indeterminate response from trying
to kill service. Verifying whether it is running using nc...
12/03/19 21:40:19 DEBUG ha.SshFenceByTcpPort: Running cmd: nc -z todd-w510 8020
12/03/19 21:40:19 INFO ha.SshFenceByTcpPort: Verified that the service is down.
> SshFenceByTcpPort uses netcat incorrectly
> -----------------------------------------
>
> Key: HDFS-3081
> URL: https://issues.apache.org/jira/browse/HDFS-3081
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 0.24.0
> Reporter: Philip Zeyliger
> Assignee: Todd Lipcon
> Attachments: hdfs-3081.txt
>
>
> SshFencyByTcpPort currently assumes that the NN is listening on localhost.
> Typical setups have the namenode listening just on the hostname of the
> namenode, which would lead "nc -z" to not catch it.
> Here's an example in which the NN is running, listening on 8020, but doesn't
> respond to "localhost 8020".
> {noformat}
> [root@xxx ~]# lsof -P -p 5286 | grep -i listen
> java 5286 root 110u IPv4 1772357 TCP xxx:8020
> (LISTEN)
> java 5286 root 121u IPv4 1772397 TCP xxx:50070
> (LISTEN)
> [root@xxx ~]# nc -z localhost 8020
> [root@xxx ~]# nc -z xxx 8020
> Connection to xxx 8020 port [tcp/intu-ec-svcdisc] succeeded!
> {noformat}
> Here's the likely offending code:
> {code}
> LOG.info(
> "Indeterminate response from trying to kill service. " +
> "Verifying whether it is running using nc...");
> rc = execCommand(session, "nc -z localhost 8020");
> {code}
> Naively, we could rely on netcat to the correct hostname (since the NN ought
> to be listening on the hostname it's configured as), or just to use fuser.
> Fuser catches ports independently of what IPs they're bound to:
> {noformat}
> [root@xxx ~]# fuser 1234/tcp
> 1234/tcp: 6766 6768
> [root@xxx ~]# jobs
> [1]- Running nc -l localhost 1234 &
> [2]+ Running nc -l rhel56-18.ent.cloudera.com 1234 &
> [root@xxx ~]# sudo lsof -P | grep -i LISTEN | grep -i 1234
> nc 6766 root 3u IPv4 2563626
> TCP localhost:1234 (LISTEN)
> nc 6768 root 3u IPv4 2563671
> TCP xxx:1234 (LISTEN)
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira