Re: Review Request 33166: Ambari Agent holding socket open on 50070 prevents NN from starting

Alejandro Fernandez Tue, 14 Apr 2015 10:23:22 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33166/#review80059
-----------------------------------------------------------



Where do we specify what ports ambari-agent is allowed to listen on? Trying to 
make Hadoop and ambari-agent use disjoint sets is one option, although 
difficult to enforce, but this is something else we could do to avoid port 
collisions.

- Alejandro Fernandez


On April 14, 2015, 2:09 p.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33166/
> -----------------------------------------------------------
> 
> (Updated April 14, 2015, 2:09 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-10464
>     https://issues.apache.org/jira/browse/AMBARI-10464
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> The Ambari Agent process appears to be listening on port 50070 and holding it 
> open. This is causing the NN to fail to start until the Ambari Agent is 
> restarted. A netstat -natp reveals that the agent process has this port open.
> ```
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
> ```
> 
> After digging some more through sockets and linux, I think it's entirely 
> possible that the agent could be assigned a source port that matches the 
> destination port. Anything in the ephemeral port range is up for grabs. 
> Essentially what is happening here is that NN is down and when the agent 
> tries to check it via a socket connection to 50070, the source (client) side 
> of the socket connection binds to 50070 since it's open and within the range 
> specified by /proc/sys/net/ipv4/ip_local_port_range
> 
> The client essentially connects to itself; the WEB alert connection timeout 
> is set to 10 seconds. That means that after 10 seconds, it will release the 
> connection automatically. The METRIC alerts, however, use a slightly 
> different mechanism of opening the socket and don't specify the socket 
> timeout. For a METRIC alert, when both the source and destination ports are 
> the same, it will connection and hold that connection for as long as 
> socket._GLOBAL_DEFAULT_TIMEOUT which could be a very long time.
> 
> I believe that we need to change METRIC alert to pass in a timeout value to 
> the socket (between 5 and 10 seconds just like WEB alerts)
> Since the Hadoop components seem to use emphemeral ports that the OS says are 
> free game to any client, this will still end up being a problem. The above 
> proposed fix will make it so that the agent will release the socket after a 
> while preventing the need to restart the agent after fixing the problem. But 
> it's still possible that the agent could bind to that port when making its 
> check.
> 
> 
> Diffs
> -----
> 
>   ambari-agent/src/main/python/ambari_agent/alerts/metric_alert.py 8b5f15d 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py
>  032310d 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_ha_namenode_health.py
>  058b7b2 
>   
> ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py
>  fb6c4c2 
>   
> ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py
>  8c72f4c 
>   
> ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py
>  b297b0c 
>   
> ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_checkpoint_time.py
>  032310d 
>   
> ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_ha_namenode_health.py
>  058b7b2 
>   
> ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/WEBHCAT/package/files/alert_webhcat_server.py
>  fb6c4c2 
>   
> ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/YARN/package/files/alert_nodemanager_health.py
>  8c72f4c 
> 
> Diff: https://reviews.apache.org/r/33166/diff/
> 
> 
> Testing
> -------
> 
> I was able to force the alerts to use a specific client port (under Python 
> 2.7) - I chose 50070 since that's the port in this issue. I then verified 
> that when binding, the metric alerts did not let the port go until the agent 
> was restarted. After the fixes were applied, the agent was still able to bind 
> to 50070, but it did release it after the specified timeout.
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>

Re: Review Request 33166: Ambari Agent holding socket open on 50070 prevents NN from starting

Reply via email to