Is it possible that the FQDN/hostname of the agent hosts have changed? E.g. Agents initially registered themselves as host A (you can get that using API server:8080/api/v1/clusters/<cluster name>/hosts) and after the network configuration the agents started sending as their heartbeat as B (server:8080/api/v1/hosts will tell you about the hosts that have registered)
-Sumit On 7/15/13 8:47 AM, "Brian Jeltema" <[email protected]> wrote: >I had to do some network reconfiguration on our cluster. After rebooting >everything and restarting >the ambari server and the ambari agents, the server reports (via the UI) >that it is not receiving heartbeats. >However, when I look at the server and agent logs, I see heartbeat >activity: > >agent: >INFO 2013-07-15 11:40:12,169 Heartbeat.py:61 - Sending heartbeat with >response id: 251 and timestamp: 1373902812168 >INFO 2013-07-15 11:40:12,214 Controller.py:176 - No commands sent from >the Server. > >server >11:41:44,760 INFO HeartBeatHandler:108 - Received heartbeat from host, >hostname=foo.net, currentResponseId=260, receivedResponseId=260 >11:41:44,761 INFO AgentResource:109 - Sending heartbeat response with >response id 261 > >(response id's don't match because I didn't try to capture them in >unison). I suspect there may be persisted state in the postgres database >from the previous network configuration that is causing the problem. Any >suggestions for a fix short of a complete redeploy? > >TIA > >Brian
