[ 
https://issues.apache.org/jira/browse/AMBARI-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494155#comment-14494155
 ] 

Hadoop QA commented on AMBARI-10464:
------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12725228/AMBARI-10464.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:red}-1 core tests{color}.  The test build failed in ambari-server 

Test results: 
https://builds.apache.org/job/Ambari-trunk-test-patch/2328//testReport/
Console output: 
https://builds.apache.org/job/Ambari-trunk-test-patch/2328//console

This message is automatically generated.

> Ambari Agent holding socket open on 50070 prevents NN from starting
> -------------------------------------------------------------------
>
>                 Key: AMBARI-10464
>                 URL: https://issues.apache.org/jira/browse/AMBARI-10464
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 2.0.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: AMBARI-10464.patch
>
>
> The Ambari Agent process appears to be listening on port 50070 and holding it 
> open. This is causing the NN to fail to start until the Ambari Agent is 
> restarted. A netstat -natp reveals that the agent process has this port open.
> {noformat}
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
> {noformat}
> After digging some more through sockets and linux, I think it's entirely 
> possible that the agent could be assigned a source port that matches the 
> destination port. Anything in the ephemeral port range is up for grabs. 
> Essentially what is happening here is that NN is down and when the agent 
> tries to check it via a socket connection to 50070, the source (client) side 
> of the socket connection binds to 50070 since it's open and within the range 
> specified by {{/proc/sys/net/ipv4/ip_local_port_range}}
> The client essentially connects to itself; the WEB alert connection timeout 
> is set to 10 seconds. That means that after 10 seconds, it will release the 
> connection automatically. The METRIC alerts, however, use a slightly 
> different mechanism of opening the socket and don't specify the socket 
> timeout. For a METRIC alert, when both the source and destination ports are 
> the same, it will connection and hold that connection for as long as 
> {{socket._GLOBAL_DEFAULT_TIMEOUT}} which could be a very long time.
> - I believe that we need to change METRIC alert to pass in a timeout value to 
> the socket (between 5 and 10 seconds just like WEB alerts)
> - Since the Hadoop components seem to use emphemeral ports that the OS says 
> are free game to any client, this will still end up being a problem. The 
> above proposed fix will make it so that the agent will release the socket 
> after a while preventing the need to restart the agent after fixing the 
> problem. But it's still possible that the agent could bind to that port when 
> making its check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to