Alex Piggott created AMBARI-12485:
-------------------------------------
Summary: Ambari agent stopped reporting status until some file was
deleted
Key: AMBARI-12485
URL: https://issues.apache.org/jira/browse/AMBARI-12485
Project: Ambari
Issue Type: Bug
Components: ambari-agent
Affects Versions: 2.0.0
Environment: Centos6
Reporter: Alex Piggott
1) I restarted YARN after making a config change, and observed that on one of
the 4 nodes of a cluster (call it db001) was not restarting any of them.
2) I restarted ambari-agent on db001 from the command line, at which point all
services remained shown as down (red)
3) Note that I _was_ then able to restart the YARN components on db001
4) I found the following error message being generated every minute:
{code}
[root@db001 ~]# more /var/lib/ambari-agent/data/status_command_stderr.txt
Traceback (most recent call last):
File
"/var/lib/ambari-agent/cache/common-services/ZOOKEEPER/3.4.5.2.0/package/scripts/zookeeper_client.py",
line 67, in <module>
ZookeeperClient().execute()
File
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
line 181, in execute
self.load_structured_out()
File
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
line 109, in load_structured_out
Script.structuredOut = json.load(fp)
File "/usr/lib64/python2.6/json/__init__.py", line 267, in load
parse_constant=parse_constant, **kw)
File "/usr/l
ib64/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.6/json/decoder.py", line 338, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
{code}
Files:
{code}
-rw-r--r-- 1 root root 0 Jul 21 16:50 status_command_stdout.txt
-rw------- 1 root root 18310 Jul 21 16:50 status_command.json
-rw-r--r-- 1 root root 1008 Jul 21 16:50 status_command_stderr.txt
{code}
I stuck some print statements in the python (!!) and found out that the failing
file was an empty file not modified since Jul 19 (today==Jul 21):
{code}
[root@db001 data]# ls -l /var/lib/ambari-agent/data/structured-out-status.json
-rw-rw-rw- 1 root root 0 Jul 19 01:22
/var/lib/ambari-agent/data/structured-out-status.json
{code}
Upon deleting that, the error messages went away, and Ambari showed all
components as green again.
Note that nobody had touched the cluster since July 14
Hope this report is of some use!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)