[
https://issues.apache.org/jira/browse/AMBARI-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yusaku Sako updated AMBARI-8768:
--------------------------------
Fix Version/s: 2.1.0
> Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper
> re-initialization of agent upon restart
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: AMBARI-8768
> URL: https://issues.apache.org/jira/browse/AMBARI-8768
> Project: Ambari
> Issue Type: Bug
> Components: ambari-agent
> Affects Versions: 1.7.0
> Environment: HDP 2.1
> Reporter: Hari Sekhon
> Assignee: Dmytro Sen
> Fix For: 2.1.0
>
> Attachments: AMBARI-8768.patch
>
>
> Ambari agent is succeptible to hanging when the 'df' command blocks. This
> causes loss of heartbeat and manageability. I've found this has happened with
> NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set
> the NFS soft option on the mount point but then realized that wasn't a good
> idea as not everyone's processes and scripts will handle failure gracefully
> and retry properly).
> When restarting the agent it also leaves the df process bound to point 8670
> which requires manually killing that in order to get the ambari agent to
> restart and bind successfully, but even then you'll see a hang at this point
> after connecting to the 8440 ca and the agent never fully initializes so the
> heartbeat still never comes back.
> The df command should be either in another thread non-blocking the main
> heartbeat and management functions or should have a timeout set on the
> command execution to prevent this issue.
> Regards,
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)