[ 
https://issues.apache.org/jira/browse/AMBARI-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552893#comment-14552893
 ] 

Hudson commented on AMBARI-8768:
--------------------------------

FAILURE: Integrated in Ambari-trunk-Commit #2660 (See 
[https://builds.apache.org/job/Ambari-trunk-Commit/2660/])
AMBARI-8768 Ambari agent Heartbeat lost when df hangs (NFS gateway), also 
prevents proper re-initialization of agent upon restart (additional patch) 
(dsen) (dsen: 
http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=832f3b9c3dc64997e1a5dbccc585d9acb3e3591c)
* ambari-agent/src/main/python/ambari_agent/Hardware.py


> Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper 
> re-initialization of agent upon restart
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-8768
>                 URL: https://issues.apache.org/jira/browse/AMBARI-8768
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>         Environment: HDP 2.1
>            Reporter: Hari Sekhon
>            Assignee: Dmytro Sen
>         Attachments: AMBARI-8768.patch
>
>
> Ambari agent is succeptible to hanging when the 'df' command blocks. This 
> causes loss of heartbeat and manageability. I've found this has happened with 
> NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set 
> the NFS soft option on the mount point but then realized that wasn't a good 
> idea as not everyone's processes and scripts will handle failure gracefully 
> and retry properly).
> When restarting the agent it also leaves the df process bound to point 8670 
> which requires manually killing that in order to get the ambari agent to 
> restart and bind successfully, but even then you'll see a hang at this point 
> after connecting to the 8440 ca and the agent never fully initializes so the 
> heartbeat still never comes back.
> The df command should be either in another thread non-blocking the main 
> heartbeat and management functions or should have a timeout set on the 
> command execution to prevent this issue.
> Regards,
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to