Alex Piggott created AMBARI-12485:
-------------------------------------

             Summary: Ambari agent stopped reporting status until some file was 
deleted
                 Key: AMBARI-12485
                 URL: https://issues.apache.org/jira/browse/AMBARI-12485
             Project: Ambari
          Issue Type: Bug
          Components: ambari-agent
    Affects Versions: 2.0.0
         Environment: Centos6
            Reporter: Alex Piggott


1) I restarted YARN after making a config change, and observed that on one of 
the 4 nodes of a cluster (call it db001) was not restarting any of them.

2) I restarted ambari-agent on db001 from the command line, at which point all 
services remained shown as down (red)

3) Note that I _was_ then able to restart the YARN components on db001

4) I found the following error message being generated every minute:

{code}
[root@db001 ~]# more /var/lib/ambari-agent/data/status_command_stderr.txt
Traceback (most recent call last):
  File 
"/var/lib/ambari-agent/cache/common-services/ZOOKEEPER/3.4.5.2.0/package/scripts/zookeeper_client.py",
 line 67, in <module>
    ZookeeperClient().execute()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 181, in execute
    self.load_structured_out()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 109, in load_structured_out
    Script.structuredOut = json.load(fp)
  File "/usr/lib64/python2.6/json/__init__.py", line 267, in load
    parse_constant=parse_constant, **kw)
  File "/usr/l
ib64/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.6/json/decoder.py", line 338, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
{code}

Files:
{code}
-rw-r--r-- 1 root root     0 Jul 21 16:50 status_command_stdout.txt
-rw------- 1 root root 18310 Jul 21 16:50 status_command.json
-rw-r--r-- 1 root root  1008 Jul 21 16:50 status_command_stderr.txt
{code}

I stuck some print statements in the python (!!) and found out that the failing 
file was an empty file not modified since Jul 19 (today==Jul 21):

{code}
[root@db001 data]# ls -l /var/lib/ambari-agent/data/structured-out-status.json
-rw-rw-rw- 1 root root 0 Jul 19 01:22 
/var/lib/ambari-agent/data/structured-out-status.json
{code}

Upon deleting that, the error messages went away, and Ambari showed all 
components as green again.

Note that nobody had touched the cluster since July 14

Hope this report is of some use!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to