-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34859/
-----------------------------------------------------------

Review request for Ambari, Andrew Onischuk, Emil Anca, and Jonathan Hurley.


Bugs: AMBARI-11570
    https://issues.apache.org/jira/browse/AMBARI-11570


Repository: ambari


Description
-------

If a cluster has been up for several days, Ambari complains that one or more of 
the agents have stopped heartbeating. This has been observed on Kerberized 
clusters, but may also occur on non-Kerberized clusters (not tested).

Looking at the ambari-agent log, it appear there may be an _open file_ issue

*/var/log/ambari-agent/ambari-agent.log*
```
INFO 2015-05-28 11:43:13,547 Controller.py:244 - Heartbeat response received 
(id = 5382)
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service ZOOKEEPER of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service AMBARI_METRICS of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service HDFS of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service YARN of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service HDFS of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service MAPREDUCE2 of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service YARN of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service TEZ of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service HIVE of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service HIVE of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service PIG of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service ZOOKEEPER of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service AMBARI_METRICS of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for 
service KERBEROS of cluster BUG36704 to the queue.
INFO 2015-05-28 11:43:23,549 Heartbeat.py:78 - Building Heartbeat: {responseId 
= 5382, timestamp = 1432813403549, commandsInProgress = False, componentsMapped 
= True}
ERROR 2015-05-28 11:43:23,553 Controller.py:330 - Connection to 
levas-36704-1.c.pramod-thangali.internal was lost (details=[Errno 24] Too many 
open files: '/sys/kernel/mm/redhat_transparent_hugepage/enabled')
INFO 2015-05-28 11:43:34,555 NetUtil.py:59 - Connecting to 
https://levas-36704-1.c.pramod-thangali.internal:8440/connection_info
INFO 2015-05-28 11:43:34,627 security.py:93 - SSL Connect being called.. 
connecting to the server
INFO 2015-05-28 11:43:34,696 security.py:55 - SSL connection established. 
Two-way SSL authentication is turned off on the server.
INFO 2015-05-28 11:43:34,897 Controller.py:244 - Heartbeat response received 
(id = 5382)
ERROR 2015-05-28 11:43:34,897 Controller.py:262 - Error in responseId sequence 
- restarting
WARNING 2015-05-28 11:43:42,860 base_alert.py:140 - 
[Alert][yarn_nodemanager_health] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:43:42,873 base_alert.py:140 - 
[Alert][ams_metrics_monitor_process] Unable to execute alert. [Errno 24] Too 
many open files
WARNING 2015-05-28 11:44:00,537 base_alert.py:140 - 
[Alert][ambari_agent_disk_usage] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:44:42,860 base_alert.py:140 - 
[Alert][yarn_nodemanager_health] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:44:42,880 base_alert.py:140 - 
[Alert][ams_metrics_monitor_process] Unable to execute alert. [Errno 24] Too 
many open files
WARNING 2015-05-28 11:44:43,002 base_alert.py:140 - [Alert][datanode_storage] 
Unable to execute alert. [Errno 24] Too many open files
WARNING 2015-05-28 11:45:00,549 base_alert.py:140 - 
[Alert][ambari_agent_disk_usage] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:45:42,860 base_alert.py:140 - 
[Alert][yarn_nodemanager_health] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:45:42,873 base_alert.py:140 - 
[Alert][ams_metrics_monitor_process] Unable to execute alert. [Errno 24] Too 
many open files
WARNING 2015-05-28 11:46:00,537 base_alert.py:140 - 
[Alert][ambari_agent_disk_usage] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:46:42,861 base_alert.py:140 - 
[Alert][yarn_nodemanager_health] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:46:42,863 base_alert.py:140 - 
[Alert][ams_metrics_collector_hbase_master_cpu] Unable to execute alert. 
[Alert][ams_metrics_collector_hbase_master_cpu] Unable to get json from jmx 
response!
WARNING 2015-05-28 11:46:42,892 base_alert.py:140 - 
[Alert][ams_metrics_monitor_process] Unable to execute alert. [Errno 24] Too 
many open files
WARNING 2015-05-28 11:46:42,899 base_alert.py:140 - [Alert][datanode_storage] 
Unable to execute alert. [Errno 24] Too many open files
WARNING 2015-05-28 11:47:00,539 base_alert.py:140 - 
[Alert][ambari_agent_disk_usage] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:47:42,861 base_alert.py:140 - 
[Alert][yarn_nodemanager_health] Unable to execute alert. [Errno 24] Too many 
open files
WARNING 2015-05-28 11:47:42,873 base_alert.py:140 - 
[Alert][ams_metrics_monitor_process] Unable to execute alert. [Errno 24] Too 
many open files
WARNING 2015-05-28 11:48:00,541 base_alert.py:140 - 
[Alert][ambari_agent_disk_usage] Unable to execute alert. [Errno 24] Too many 
open files
...
```

Restarting the ambari-agent reconnects with the server and the cluster becomes 
happy.


Diffs
-----

  ambari-agent/src/test/python/resource_management/TestSecurityCommons.py 
ead0351 
  
ambari-common/src/main/python/resource_management/libraries/functions/security_commons.py
 688eba7 

Diff: https://reviews.apache.org/r/34859/diff/


Testing
-------

Manually tested and viewed `lsof` output to make sure previosuly offending open 
files were no longer left open.


Thanks,

Robert Levas

Reply via email to