Hari Sekhon created AMBARI-10757:
------------------------------------

             Summary: MapReduce History Server curl call gets stuck, agent 
restarts fail on "Address already in use" error
                 Key: AMBARI-10757
                 URL: https://issues.apache.org/jira/browse/AMBARI-10757
             Project: Ambari
          Issue Type: Bug
          Components: ambari-agent
    Affects Versions: 2.0.0
         Environment: HDP 2.2
            Reporter: Hari Sekhon
            Priority: Minor


The curl call to the MapReduce History server gets stuck, which appears to 
block the ambari-agent (typical no health check report in 3 minutes in Ambari 
UI). Restarting ambari-agent gives the usual "Address already in use error":
{code}# ps -ef|grep ambari-agent
root     17616 14155  0 10:27 pts/11   00:00:00 curl --negotiate -u   -b 
/var/lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -c 
/var/
lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -sL -w 
%{http_code} http://host:19888 --connect-timeout 10 -o /dev/null
root     17677 12202  0 10:28 pts/11   00:00:00 grep ambari-agent
# date
Mon Apr 27 10:28:21 BST 2015
...
# date
Mon Apr 27 10:29:11 BST 2015
# ps -ef|grep ambari-agent
root     17616 14155  0 10:27 pts/11   00:00:00 curl --negotiate -u   -b 
/var/lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -c 
/var/lib/ambari-agent/data/tmp/cookies/3a05acb6-5d0c-4b6a-9304-91af19ae4efa -sL 
-w %{http_code} http://host:19888 --connect-timeout 10 -o /dev/null
{code}
Although there is a 10 sec timeout passed to curl itself
{code}... --connect-timeout 10 ...{code}
the man page says this is only for connection initiation, if the connection 
somehow hung after connection I believe this would not help - that must be what 
is happening in this case.

After killing the curl call, another the stuck 'df' command was also still then 
holding the port as described in AMBARI-8768, killing that finally freed the 
port and allowed Ambari agent restart to succeed and heartbeat back to the 
Ambari server.

This is related to AMBARI-8768 in that basically it's same type of problem of 
not having hard timeouts in code on all commands. It's also a similar type of 
problem to AMBARI-10495 and AMBARI-9197, all really related to not having 
generic timeouts applied in Ambari.

There should be a general change made to Ambari to timeout all arbitrary 
commands and actions after a reasonably long period, configurable at time of 
each command call in the code.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to