[
https://issues.apache.org/jira/browse/AMBARI-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Hurley resolved AMBARI-10464.
--------------------------------------
Resolution: Fixed
> Ambari Agent holding socket open on 50070 prevents NN from starting
> -------------------------------------------------------------------
>
> Key: AMBARI-10464
> URL: https://issues.apache.org/jira/browse/AMBARI-10464
> Project: Ambari
> Issue Type: Bug
> Components: ambari-agent
> Affects Versions: 2.0.0
> Reporter: Jonathan Hurley
> Assignee: Jonathan Hurley
> Priority: Critical
> Fix For: 2.1.0
>
> Attachments: AMBARI-10464.patch
>
>
> The Ambari Agent process appears to be listening on port 50070 and holding it
> open. This is causing the NN to fail to start until the Ambari Agent is
> restarted. A netstat -natp reveals that the agent process has this port open.
> {noformat}
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
> {noformat}
> After digging some more through sockets and linux, I think it's entirely
> possible that the agent could be assigned a source port that matches the
> destination port. Anything in the ephemeral port range is up for grabs.
> Essentially what is happening here is that NN is down and when the agent
> tries to check it via a socket connection to 50070, the source (client) side
> of the socket connection binds to 50070 since it's open and within the range
> specified by {{/proc/sys/net/ipv4/ip_local_port_range}}
> The client essentially connects to itself; the WEB alert connection timeout
> is set to 10 seconds. That means that after 10 seconds, it will release the
> connection automatically. The METRIC alerts, however, use a slightly
> different mechanism of opening the socket and don't specify the socket
> timeout. For a METRIC alert, when both the source and destination ports are
> the same, it will connection and hold that connection for as long as
> {{socket._GLOBAL_DEFAULT_TIMEOUT}} which could be a very long time.
> - I believe that we need to change METRIC alert to pass in a timeout value to
> the socket (between 5 and 10 seconds just like WEB alerts)
> - Since the Hadoop components seem to use emphemeral ports that the OS says
> are free game to any client, this will still end up being a problem. The
> above proposed fix will make it so that the agent will release the socket
> after a while preventing the need to restart the agent after fixing the
> problem. But it's still possible that the agent could bind to that port when
> making its check.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)