[
https://issues.apache.org/jira/browse/AMBARI-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552893#comment-14552893
]
Hudson commented on AMBARI-8768:
--------------------------------
FAILURE: Integrated in Ambari-trunk-Commit #2660 (See
[https://builds.apache.org/job/Ambari-trunk-Commit/2660/])
AMBARI-8768 Ambari agent Heartbeat lost when df hangs (NFS gateway), also
prevents proper re-initialization of agent upon restart (additional patch)
(dsen) (dsen:
http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=832f3b9c3dc64997e1a5dbccc585d9acb3e3591c)
* ambari-agent/src/main/python/ambari_agent/Hardware.py
> Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper
> re-initialization of agent upon restart
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: AMBARI-8768
> URL: https://issues.apache.org/jira/browse/AMBARI-8768
> Project: Ambari
> Issue Type: Bug
> Components: ambari-agent
> Affects Versions: 1.7.0
> Environment: HDP 2.1
> Reporter: Hari Sekhon
> Assignee: Dmytro Sen
> Attachments: AMBARI-8768.patch
>
>
> Ambari agent is succeptible to hanging when the 'df' command blocks. This
> causes loss of heartbeat and manageability. I've found this has happened with
> NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set
> the NFS soft option on the mount point but then realized that wasn't a good
> idea as not everyone's processes and scripts will handle failure gracefully
> and retry properly).
> When restarting the agent it also leaves the df process bound to point 8670
> which requires manually killing that in order to get the ambari agent to
> restart and bind successfully, but even then you'll see a hang at this point
> after connecting to the 8440 ca and the agent never fully initializes so the
> heartbeat still never comes back.
> The df command should be either in another thread non-blocking the main
> heartbeat and management functions or should have a timeout set on the
> command execution to prevent this issue.
> Regards,
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)