-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33166/
-----------------------------------------------------------
Review request for Ambari, Alejandro Fernandez and Sumit Mohanty.
Bugs: AMBARI-10464
https://issues.apache.org/jira/browse/AMBARI-10464
Repository: ambari
Description
-------
The Ambari Agent process appears to be listening on port 50070 and holding it
open. This is causing the NN to fail to start until the Ambari Agent is
restarted. A netstat -natp reveals that the agent process has this port open.
```
root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6
```
After digging some more through sockets and linux, I think it's entirely
possible that the agent could be assigned a source port that matches the
destination port. Anything in the ephemeral port range is up for grabs.
Essentially what is happening here is that NN is down and when the agent tries
to check it via a socket connection to 50070, the source (client) side of the
socket connection binds to 50070 since it's open and within the range specified
by /proc/sys/net/ipv4/ip_local_port_range
The client essentially connects to itself; the WEB alert connection timeout is
set to 10 seconds. That means that after 10 seconds, it will release the
connection automatically. The METRIC alerts, however, use a slightly different
mechanism of opening the socket and don't specify the socket timeout. For a
METRIC alert, when both the source and destination ports are the same, it will
connection and hold that connection for as long as
socket._GLOBAL_DEFAULT_TIMEOUT which could be a very long time.
I believe that we need to change METRIC alert to pass in a timeout value to the
socket (between 5 and 10 seconds just like WEB alerts)
Since the Hadoop components seem to use emphemeral ports that the OS says are
free game to any client, this will still end up being a problem. The above
proposed fix will make it so that the agent will release the socket after a
while preventing the need to restart the agent after fixing the problem. But
it's still possible that the agent could bind to that port when making its
check.
Diffs
-----
ambari-agent/src/main/python/ambari_agent/alerts/metric_alert.py 8b5f15d
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py
032310d
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_ha_namenode_health.py
058b7b2
ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py
fb6c4c2
ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py
8c72f4c
ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py
b297b0c
ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_checkpoint_time.py
032310d
ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_ha_namenode_health.py
058b7b2
ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/WEBHCAT/package/files/alert_webhcat_server.py
fb6c4c2
ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/YARN/package/files/alert_nodemanager_health.py
8c72f4c
Diff: https://reviews.apache.org/r/33166/diff/
Testing
-------
I was able to force the alerts to use a specific client port (under Python 2.7)
- I chose 50070 since that's the port in this issue. I then verified that when
binding, the metric alerts did not let the port go until the agent was
restarted. After the fixes were applied, the agent was still able to bind to
50070, but it did release it after the specified timeout.
Thanks,
Jonathan Hurley