[ 
https://issues.apache.org/jira/browse/AMBARI-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yusaku Sako updated AMBARI-8768:
--------------------------------
    Fix Version/s: 2.1.0

> Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper 
> re-initialization of agent upon restart
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-8768
>                 URL: https://issues.apache.org/jira/browse/AMBARI-8768
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>         Environment: HDP 2.1
>            Reporter: Hari Sekhon
>            Assignee: Dmytro Sen
>             Fix For: 2.1.0
>
>         Attachments: AMBARI-8768.patch
>
>
> Ambari agent is succeptible to hanging when the 'df' command blocks. This 
> causes loss of heartbeat and manageability. I've found this has happened with 
> NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set 
> the NFS soft option on the mount point but then realized that wasn't a good 
> idea as not everyone's processes and scripts will handle failure gracefully 
> and retry properly).
> When restarting the agent it also leaves the df process bound to point 8670 
> which requires manually killing that in order to get the ambari agent to 
> restart and bind successfully, but even then you'll see a hang at this point 
> after connecting to the 8440 ca and the agent never fully initializes so the 
> heartbeat still never comes back.
> The df command should be either in another thread non-blocking the main 
> heartbeat and management functions or should have a timeout set on the 
> command execution to prevent this issue.
> Regards,
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to