[ 
https://issues.apache.org/jira/browse/AMBARI-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hurley resolved AMBARI-10464.
--------------------------------------
    Resolution: Fixed

> Ambari Agent holding socket open on 50070 prevents NN from starting
> -------------------------------------------------------------------
>
>                 Key: AMBARI-10464
>                 URL: https://issues.apache.org/jira/browse/AMBARI-10464
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 2.0.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: AMBARI-10464.patch
>
>
> The Ambari Agent process appears to be listening on port 50070 and holding it 
> open. This is causing the NN to fail to start until the Ambari Agent is 
> restarted. A netstat -natp reveals that the agent process has this port open.
> {noformat}
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
> {noformat}
> After digging some more through sockets and linux, I think it's entirely 
> possible that the agent could be assigned a source port that matches the 
> destination port. Anything in the ephemeral port range is up for grabs. 
> Essentially what is happening here is that NN is down and when the agent 
> tries to check it via a socket connection to 50070, the source (client) side 
> of the socket connection binds to 50070 since it's open and within the range 
> specified by {{/proc/sys/net/ipv4/ip_local_port_range}}
> The client essentially connects to itself; the WEB alert connection timeout 
> is set to 10 seconds. That means that after 10 seconds, it will release the 
> connection automatically. The METRIC alerts, however, use a slightly 
> different mechanism of opening the socket and don't specify the socket 
> timeout. For a METRIC alert, when both the source and destination ports are 
> the same, it will connection and hold that connection for as long as 
> {{socket._GLOBAL_DEFAULT_TIMEOUT}} which could be a very long time.
> - I believe that we need to change METRIC alert to pass in a timeout value to 
> the socket (between 5 and 10 seconds just like WEB alerts)
> - Since the Hadoop components seem to use emphemeral ports that the OS says 
> are free game to any client, this will still end up being a problem. The 
> above proposed fix will make it so that the agent will release the socket 
> after a while preventing the need to restart the agent after fixing the 
> problem. But it's still possible that the agent could bind to that port when 
> making its check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to