----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33166/#review80059 -----------------------------------------------------------
Where do we specify what ports ambari-agent is allowed to listen on? Trying to make Hadoop and ambari-agent use disjoint sets is one option, although difficult to enforce, but this is something else we could do to avoid port collisions. - Alejandro Fernandez On April 14, 2015, 2:09 p.m., Jonathan Hurley wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/33166/ > ----------------------------------------------------------- > > (Updated April 14, 2015, 2:09 p.m.) > > > Review request for Ambari, Alejandro Fernandez and Sumit Mohanty. > > > Bugs: AMBARI-10464 > https://issues.apache.org/jira/browse/AMBARI-10464 > > > Repository: ambari > > > Description > ------- > > The Ambari Agent process appears to be listening on port 50070 and holding it > open. This is causing the NN to fail to start until the Ambari Agent is > restarted. A netstat -natp reveals that the agent process has this port open. > ``` > root@hdp2-02-01 hdfs]# netstat -anp | grep 50070 > tcp 0 0 192.168.1.141:50070 192.168.1.141:50070 ESTABLISHED 1630/python2.6 > ``` > > After digging some more through sockets and linux, I think it's entirely > possible that the agent could be assigned a source port that matches the > destination port. Anything in the ephemeral port range is up for grabs. > Essentially what is happening here is that NN is down and when the agent > tries to check it via a socket connection to 50070, the source (client) side > of the socket connection binds to 50070 since it's open and within the range > specified by /proc/sys/net/ipv4/ip_local_port_range > > The client essentially connects to itself; the WEB alert connection timeout > is set to 10 seconds. That means that after 10 seconds, it will release the > connection automatically. The METRIC alerts, however, use a slightly > different mechanism of opening the socket and don't specify the socket > timeout. For a METRIC alert, when both the source and destination ports are > the same, it will connection and hold that connection for as long as > socket._GLOBAL_DEFAULT_TIMEOUT which could be a very long time. > > I believe that we need to change METRIC alert to pass in a timeout value to > the socket (between 5 and 10 seconds just like WEB alerts) > Since the Hadoop components seem to use emphemeral ports that the OS says are > free game to any client, this will still end up being a problem. The above > proposed fix will make it so that the agent will release the socket after a > while preventing the need to restart the agent after fixing the problem. But > it's still possible that the agent could bind to that port when making its > check. > > > Diffs > ----- > > ambari-agent/src/main/python/ambari_agent/alerts/metric_alert.py 8b5f15d > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py > 032310d > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_ha_namenode_health.py > 058b7b2 > > ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py > fb6c4c2 > > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py > 8c72f4c > > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanagers_summary.py > b297b0c > > ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_checkpoint_time.py > 032310d > > ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/HDFS/package/files/alert_ha_namenode_health.py > 058b7b2 > > ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/WEBHCAT/package/files/alert_webhcat_server.py > fb6c4c2 > > ambari-server/src/main/resources/stacks/BIGTOP/0.8/services/YARN/package/files/alert_nodemanager_health.py > 8c72f4c > > Diff: https://reviews.apache.org/r/33166/diff/ > > > Testing > ------- > > I was able to force the alerts to use a specific client port (under Python > 2.7) - I chose 50070 since that's the port in this issue. I then verified > that when binding, the metric alerts did not let the port go until the agent > was restarted. After the fixes were applied, the agent was still able to bind > to 50070, but it did release it after the specified timeout. > > > Thanks, > > Jonathan Hurley > >
