[ 
https://issues.apache.org/jira/browse/AMBARI-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357480#comment-14357480
 ] 

Hudson commented on AMBARI-10031:
---------------------------------

SUCCESS: Integrated in Ambari-trunk-Commit #2007 (See 
[https://builds.apache.org/job/Ambari-trunk-Commit/2007/])
AMBARI-10031. Ambari-agent died under SLES (and could not even restart 
automatically) (aonishuk) (aonishuk: 
http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=1b22d34e51375e265fc125fda6b587438e02d185)
* ambari-common/src/main/python/resource_management/core/shell.py
* ambari-agent/src/test/python/resource_management/TestGroupResource.py
* ambari-agent/src/test/python/resource_management/TestUserResource.py


> Ambari-agent died under SLES (and could not even restart automatically)
> -----------------------------------------------------------------------
>
>                 Key: AMBARI-10031
>                 URL: https://issues.apache.org/jira/browse/AMBARI-10031
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Andrew Onischuk
>            Assignee: Andrew Onischuk
>             Fix For: 2.0.0
>
>
> I was performing RU on weekend and left cluster running to finalize it later.
> So cluster was running unattended for 2 days, and ambari-agent died due to out
> of memory. Agents on other nodes are running well.  
> Node has 8gb of ram, does not look like memory exhausted (unless agent needs
> more then 1100 mb of ram)
>     
>     
>     
>     dmitriusan-sles3-ru1-6:~ # free -m
>                  total       used       free     shared    buffers     cached
>     Mem:          7872       7077        795          0        134        222
>     -/+ buffers/cache:       6720       1151
>     Swap:            0          0          0
>     
> So I suspect memory leak (probably due to status checks/jobs). Log files
> attached.
>     
>     
>     
>     WARNING 2015-03-10 06:10:30,692 scheduler.py:496 - Run time of job 
> "c811d199-b07f-4eaf-995b-bf91e5ff848f (trigger: interval[0:01:00], next run 
> at: 2015-03-10
>      06:11:27.480393)" was missed by 0:00:03.212293
>     WARNING 2015-03-10 06:10:38,214 scheduler.py:496 - Run time of job 
> "5c219f4e-62e1-482c-88fc-e11b40935541 (trigger: interval[0:01:00], next run 
> at: 2015-03-10
>      06:11:29.881993)" was missed by 0:00:08.332634
>     INFO 2015-03-10 06:10:38,995 scheduler.py:527 - Job 
> "13163515-f895-4342-b802-12ce39c65fb9 (trigger: interval[0:01:00], next run 
> at: 2015-03-10 06:11:27.47368
>     5)" executed successfully
>     INFO 2015-03-10 06:10:39,088 scheduler.py:527 - Job 
> "6186b998-9eb6-4f7b-af8b-96c27c0da962 (trigger: interval[0:01:00], next run 
> at: 2015-03-10 06:11:27.47213
>     9)" executed successfully
>     INFO 2015-03-10 06:10:39,089 scheduler.py:527 - Job 
> "1531e319-25e9-4909-b461-bec0ba59c1d9 (trigger: interval[0:01:00], next run 
> at: 2015-03-10 06:11:27.47290
>     7)" executed successfully
>     INFO 2015-03-10 06:10:39,123 Controller.py:247 - Heartbeat response 
> received (id = 21240)
>     INFO 2015-03-10 06:10:39,408 Controller.py:291 - No commands sent from 
> dmitriusan-sles3-ru1-5.cs1cloud.internal
>     INFO 2015-03-10 06:10:42,672 scheduler.py:527 - Job 
> "81137f2d-a1a8-433f-9446-4167a06b6fa3 (trigger: interval[0:01:00], next run 
> at: 2015-03-10 06:11:27.47332
>     0)" executed successfully
>     WARNING 2015-03-10 06:10:43,575 scheduler.py:496 - Run time of job 
> "84ac5821-646b-41c1-8ac7-a561cd75d3ef (trigger: interval[0:01:00], next run 
> at: 2015-03-10
>      06:10:41.837046)" was missed by 0:00:01.737801
>     ERROR 2015-03-10 06:10:45,043 CustomServiceOrchestrator.py:201 - Caught 
> an exception while executing custom service command: <type 
> 'exceptions.OSError'>: [Er
>     rno 12] Cannot allocate memory; [Errno 12] Cannot allocate memory
>     Traceback (most recent call last):
>       File 
> "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", 
> line 176, in runCommand
>         task_id, override_output_files, handle = handle)
>       File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", 
> line 84, in run_file
>         process = self.launch_python_subprocess(pythonCommand, tmpout, tmperr)
>       File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", 
> line 151, in launch_python_subprocess
>         stderr=tmperr, close_fds=close_fds, env=command_env)
>       File "/usr/lib64/python2.6/subprocess.py", line 623, in __init__
>         errread, errwrite)
>       File "/usr/lib64/python2.6/subprocess.py", line 1051, in _execute_child
>         self.pid = os.fork()
>     OSError: [Errno 12] Cannot allocate memory
>     
> Also, agent could not restart automatically:
>     
>     
>     
>     INFO 2015-03-10 06:11:44,312 NetUtil.py:60 - Connecting to 
> https://dmitriusan-sles3-ru1-5.cs1cloud.internal:8440/connection_info
>     INFO 2015-03-10 06:11:44,639 security.py:93 - SSL Connect being called.. 
> connecting to the server
>     INFO 2015-03-10 06:11:44,730 security.py:55 - SSL connection established. 
> Two-way SSL authentication is turned off on the server.
>     INFO 2015-03-10 06:11:44,733 Controller.py:247 - Heartbeat response 
> received (id = 21240)
>     ERROR 2015-03-10 06:11:44,733 Controller.py:261 - Error in responseId 
> sequence - restarting
>     INFO 2015-03-10 06:11:46,986 main.py:68 - loglevel=logging.INFO
>     INFO 2015-03-10 06:11:46,988 DataCleaner.py:36 - Data cleanup thread 
> started
>     INFO 2015-03-10 06:11:46,997 DataCleaner.py:117 - Data cleanup started
>     INFO 2015-03-10 06:11:47,222 DataCleaner.py:119 - Data cleanup finished
>     ERROR 2015-03-10 06:11:47,641 main.py:243 - Failed to start ping port 
> listener of: Could not open port 8670 because port already used by another 
> process:
>     UID        PID  PPID  C STIME TTY          TIME CMD
>     root      1421     1  0 06:07 ?        00:00:00 /usr/bin/sudo su 
> ambari-qa -l -s /bin/bash -c export  
> PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/u
>     
> sr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/
>     bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 
> 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk 
> '{print}'|grep -i -e 'Connec
>     tion refused' -e 'Invalid URL'
>     
>     INFO 2015-03-10 06:11:47,654 PingPortListener.py:62 - Ping port listener 
> killed
>     
> Also, manual restart failed as well
>     
>     
>     
>     ERROR: ambari-agent start failed. For more details, see 
> /var/log/ambari-agent/ambari-agent.out:
>     ====================
>     Failed to start ping port listener of: Could not open port 8670 because 
> port already used by another process:
>     UID        PID  PPID  C STIME TTY          TIME CMD
>     root     25597     1  0 05:59 ?        00:00:00 /usr/bin/sudo su 
> ambari-qa -l -s /bin/bash -c export  
> PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/lib/hive/bin/:/usr/sbin/'
>  ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' 
> -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'
>     ====================
>     Agent out at: /var/log/ambari-agent/ambari-agent.out
>     Agent log at: /var/log/ambari-agent/ambari-agent.log
>     



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to