-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34470/
-----------------------------------------------------------
Review request for Ambari, Dmitro Lisnichenko and Myroslav Papirkovskyy.
Bugs: AMBARI-8768
https://issues.apache.org/jira/browse/AMBARI-8768
Repository: ambari
Description
-------
Ambari agent is succeptible to hanging when the 'df' command blocks. This
causes loss of heartbeat and manageability. I've found this has happened with
NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set
the NFS soft option on the mount point but then realized that wasn't a good
idea as not everyone's processes and scripts will handle failure gracefully and
retry properly).
When restarting the agent it also leaves the df process bound to point 8670
which requires manually killing that in order to get the ambari agent to
restart and bind successfully, but even then you'll see a hang at this point
after connecting to the 8440 ca and the agent never fully initializes so the
heartbeat still never comes back.
The df command should be either in another thread non-blocking the main
heartbeat and management functions or should have a timeout set on the command
execution to prevent this issue.
Diffs
-----
ambari-agent/src/main/python/ambari_agent/Hardware.py 439803d
ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
c2cf2c0
ambari-server/src/test/java/org/apache/ambari/server/agent/TestHeartbeatHandler.java
2b1c355
Diff: https://reviews.apache.org/r/34470/diff/
Testing
-------
unit tests passed
Thanks,
Dmytro Sen