Jonathan Hurley created AMBARI-10464:
----------------------------------------

             Summary: Ambari Agent holding socket open on 50070 prevents NN 
from starting
                 Key: AMBARI-10464
                 URL: https://issues.apache.org/jira/browse/AMBARI-10464
             Project: Ambari
          Issue Type: Bug
          Components: ambari-agent
    Affects Versions: 2.0.0
            Reporter: Jonathan Hurley
            Assignee: Jonathan Hurley
            Priority: Critical
             Fix For: 2.1.0


The Ambari Agent process appears to be listening on port 50070 and holding it 
open. This is causing the NN to fail to start until the Ambari Agent is 
restarted. A netstat -natp reveals that the agent process has this port open.

{noformat}
root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
{noformat}

After digging some more through sockets and linux, I think it's entirely 
possible that the agent could be assigned a source port that matches the 
destination port. Anything in the ephemeral port range is up for grabs. 
Essentially what is happening here is that NN is down and when the agent tries 
to check it via a socket connection to 50070, the source (client) side of the 
socket connection binds to 50070 since it's open and within the range specified 
by {{/proc/sys/net/ipv4/ip_local_port_range}}

The client essentially connects to itself; the WEB alert connection timeout is 
set to 10 seconds. That means that after 10 seconds, it will release the 
connection automatically. The METRIC alerts, however, use a slightly different 
mechanism of opening the socket and don't specify the socket timeout. For a 
METRIC alert, when both the source and destination ports are the same, it will 
connection and hold that connection for as long as 
{{socket._GLOBAL_DEFAULT_TIMEOUT}} which could be a very long time.

- I believe that we need to change METRIC alert to pass in a timeout value to 
the socket (between 5 and 10 seconds just like WEB alerts)
- Since the Hadoop components seem to use emphemeral ports that the OS says are 
free game to any client, this will still end up being a problem. The above 
proposed fix will make it so that the agent will release the socket after a 
while preventing the need to restart the agent after fixing the problem. But 
it's still possible that the agent could bind to that port when making its 
check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to