[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722193#comment-13722193
 ] 

Koushik Das commented on CLOUDSTACK-2428:
-----------------------------------------

Excerpts from the attached log show that host was determined as down

2013-07-29 10:13:45,370 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) Investigating why host 1 has disconnected with event 
PingTimeout
2013-07-29 10:13:45,370 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) checking if agent (1) is alive
2013-07-29 10:13:45,380 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:null) Seq 
1-1386348590: Sending  { Cmd , MgmtId: 2546680725505, via: 1, Ver: v1, Flags: 
100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2013-07-29 10:13:45,380 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:null) Seq 
1-1386348590: Executing:  { Cmd , MgmtId: 2546680725505, via: 1, Ver: v1, 
Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2013-07-29 10:13:50,385 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:null) Seq 
1-1386348590: Received:  { Ans: , MgmtId: 2546680725505, via: 1, Ver: v1, 
Flags: 10, { CheckHealthAnswer } }
2013-07-29 10:13:50,385 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) Details from executing class 
com.cloud.agent.api.CheckHealthCommand: resource is not alive
2013-07-29 10:13:50,389 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-10:null) SimpleInvestigator unable to determine the state of the 
host.  Moving on.
2013-07-29 10:13:50,405 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:null) Seq 
6-1235288103: Sending  { Cmd , MgmtId: 2546680725505, via: 6, Ver: v1, Flags: 
100011, 
[{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"ccec7842-0c88-4181-b558-f5180fe36918","privateNetwork":{"ip":"10.147.29.63","netmask":"255.255.255.0","mac":"00:10:18:89:9a:67","isSecurityGroupEnabled":false},"publicNetwork":{"ip":"10.147.29.63","netmask":"255.255.255.0","mac":"00:10:18:89:9a:67","isSecurityGroupEnabled":false},"storageNetwork1":{"ip":"10.147.29.63","netmask":"255.255.255.0","mac":"00:10:18:89:9a:67","isSecurityGroupEnabled":false}},"wait":20}}]
 }
2013-07-29 10:13:50,406 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:null) Seq 
6-1235288103: Executing:  { Cmd , MgmtId: 2546680725505, via: 6, Ver: v1, 
Flags: 100011, 
[{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"ccec7842-0c88-4181-b558-f5180fe36918","privateNetwork":{"ip":"10.147.29.63","netmask":"255.255.255.0","mac":"00:10:18:89:9a:67","isSecurityGroupEnabled":false},"publicNetwork":{"ip":"10.147.29.63","netmask":"255.255.255.0","mac":"00:10:18:89:9a:67","isSecurityGroupEnabled":false},"storageNetwork1":{"ip":"10.147.29.63","netmask":"255.255.255.0","mac":"00:10:18:89:9a:67","isSecurityGroupEnabled":false}},"wait":20}}]
 }
2013-07-29 10:13:50,880 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:null) Seq 
6-1235288103: Received:  { Ans: , MgmtId: 2546680725505, via: 6, Ver: v1, 
Flags: 10, { CheckOnHostAnswer } }
2013-07-29 10:13:50,880 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) Details from executing class 
com.cloud.agent.api.CheckOnHostCommand: Heart Beat is done
2013-07-29 10:13:50,880 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-10:null) XenServerInvestigator was able to determine host 1 is 
in Down
2013-07-29 10:13:50,880 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) The state determined is Down
2013-07-29 10:13:50,880 ERROR [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) Host is down: 1-xenserver-kd1.  Starting HA on the VMs
2013-07-29 10:13:50,881 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) Host 1 is disconnecting with event HostDown
2013-07-29 10:13:50,886 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) The next status of agent 1is Down, current status is Up
2013-07-29 10:13:50,886 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) Deregistering link for 1 with state Down
2013-07-29 10:13:50,886 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:null) Remove Agent : 1
2013-07-29 10:13:50,886 DEBUG [c.c.a.m.DirectAgentAttache] 
(AgentTaskPool-10:null) Processing disconnect 1

                
> HA - When the master host is disconnected , the host status contines to 
> remain in "Up" state because of 
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of 
> slave 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-2428
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2428
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.2.0
>         Environment: Build from pvaln
>            Reporter: Sangeetha Hariharan
>            Assignee: Koushik Das
>            Priority: Critical
>             Fix For: 4.2.0
>
>         Attachments: logs_7_29, logs.rar
>
>
> 1. Advance zone  with 1 cluster with 2 hosts. Create  Shared network with 
> private vlan.
> 2. Deploy few HA enabled Vms in this network. 
> 3. pull network cable for one of the host.
> When cloudstack detects that the host is disconnected , it is not able to out 
> the host in disconnected state and start HA for Vms that are HA enabeld,
> I see the following exception in the management server logs:
> 2013-05-09 17:15:55,576 DEBUG [agent.manager.DirectAgentAttache] 
> (DirectAgent-267:null) Seq 1-1435828229: Executing request
> 2013-05-09 17:15:55,602 DEBUG [xen.resource.XenServerConnectionPool] 
> (DirectAgent-267:null) Catch Exception: 
> com.xensource.xenapi.Types$HostOffline Host is offline 10.223.81.62 due to 
> You attempted an operation which involves a host which could not be contacted.
> 2013-05-09 17:15:55,603 DEBUG [xen.resource.XenServerConnectionPool] 
> (DirectAgent-267:null) Trying to reset master of slave 10.223.81.62 to 
> 10.223.81.61
> 2013-05-09 17:16:02,319 WARN  [xen.resource.CitrixResourceBase] 
> (DirectAgent-265:null)  can not ping xenserver 
> 520d4994-8b1f-4dda-b51d-2ee63750abf6
> 2013-05-09 17:16:02,319 WARN  [agent.manager.DirectAgentAttache] 
> (DirectAgent-265:null) Unable to get current status on 1
> 2013-05-09 17:16:02,321 INFO  [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-11:null) Investigating why host 1 has disconnected with event 
> AgentDisconnected
> 2013-05-09 17:16:02,321 DEBUG [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-11:null) checking if agent (1) is alive
> 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request] 
> (AgentTaskPool-11:null) Seq 1-1435828294: Sending  { Cmd , MgmtId: 
> 7647994577963, via: 1, Ver: v1, Flags: 100011, 
> [{"CheckHealthCommand":{"wait":50}}] }
> 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request] 
> (AgentTaskPool-11:null) Seq 1-1435828294: Executing:  { Cmd , MgmtId: 
> 7647994577963, via: 1, Ver: v1, Flags: 100011, 
> [{"CheckHealthCommand":{"wait":50}}] }
> 2013-05-09 17:16:02,323 DEBUG [agent.manager.DirectAgentAttache] 
> (DirectAgent-271:null) Seq 1-1435828294: Executing request
> 2013-05-09 17:16:04,035 DEBUG [agent.manager.AgentAttache] 
> (AgentTaskPool-10:null) Seq 6-474349576: Waiting some more time because this 
> is the current command
> 2013-05-09 17:16:04,040 DEBUG [xen.resource.XenServerConnectionPool] 
> (DirectAgent-268:null) localLogout has problem Failed to read server's 
> response: connect timed out
> 2013-05-09 17:16:04,040 WARN  [agent.manager.DirectAgentAttache] 
> (DirectAgent-268:null) Seq 1-1435828292: Exception Caught while executing 
> command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of 
> slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: 
> Failed to read server's response: connect timed out
>         at 
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
>         at 
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524)
>         at 
> com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
>         at 
> com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
>         at 
> com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:04,041 DEBUG [agent.manager.DirectAgentAttache] 
> (DirectAgent-268:null) Seq 1-1435828292: Response Received:
> 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request] 
> (DirectAgent-268:null) Seq 1-1435828292: Processing:  { Ans: , MgmtId: 
> 7647994577963, via: 1, Ver: v1, Flags: 10, 
> [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
>  Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to 
> org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect 
> timed out","wait":0}}] }
> 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request] 
> (AgentTaskPool-5:null) Seq 1-1435828292: Received:  { Ans: , MgmtId: 
> 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } }
> 2013-05-09 17:16:04,041 DEBUG [cloud.ha.AbstractInvestigatorImpl] 
> (AgentTaskPool-5:null) host (10.223.81.50) cannot be pinged, returning null 
> ('I don't know')
> 2013-05-09 17:16:04,041 DEBUG [cloud.ha.UserVmDomRInvestigator] 
> (AgentTaskPool-5:null) sending ping from (5) to agent's host ip address 
> (10.223.81.50)
> 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request] 
> (AgentTaskPool-5:null) Seq 5-2082341067: Sending  { Cmd , MgmtId: 
> 7647994577963, via: 5, Ver: v1, Flags: 100011, 
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}] }
> 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request] 
> (AgentTaskPool-5:null) Seq 5-2082341067: Executing:  { Cmd , MgmtId: 
> 7647994577963, via: 5, Ver: v1, Flags: 100011, 
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}] }
> 2013-05-09 17:16:04,043 DEBUG [agent.manager.DirectAgentAttache] 
> (DirectAgent-272:null) Seq 5-2082341067: Executing request
> 2013-05-09 17:16:04,053 DEBUG [xen.resource.XenServerConnectionPool] 
> (DirectAgent-91:null) localLogout has problem Failed to read server's 
> response: connect timed out
> 2013-05-09 17:16:04,053 WARN  [agent.manager.DirectAgentAttache] 
> (DirectAgent-91:null) Seq 1-1435828293: Exception Caught while executing 
> command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of 
> slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: 
> Failed to read server's response: connect timed out
>         at 
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
>         at 
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524)
>         at 
> com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
>         at 
> com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
>         at 
> com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:04,054 DEBUG [agent.manager.DirectAgentAttache] 
> (DirectAgent-91:null) Seq 1-1435828293: Response Received:
> 2013-05-09 17:16:04,054 DEBUG [agent.transport.Request] (DirectAgent-91:null) 
> Seq 1-1435828293: Processing:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: 
> v1, Flags: 10, 
> [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
>  Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to 
> org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect 
> timed out","wait":0}}] }
> 2013-05-09 17:16:04,055 DEBUG [agent.transport.Request] 
> (AgentTaskPool-7:null) Seq 1-1435828293: Received:  { Ans: , MgmtId: 
> 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } }
> 2013-05-09 17:16:04,055 DEBUG [cloud.ha.AbstractInvestigatorImpl] 
> (AgentTaskPool-7:null) host (10.223.81.52) cannot be pinged, returning null 
> ('I don't know')
> 2013-05-09 17:16:04,055 DEBUG [cloud.ha.UserVmDomRInvestigator] 
> (AgentTaskPool-7:null) sending ping from (5) to agent's host ip address 
> (10.223.81.52)
> 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request] 
> (AgentTaskPool-7:null) Seq 5-2082341068: Sending  { Cmd , MgmtId: 
> 7647994577963, via: 5, Ver: v1, Flags: 100011, 
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}] }
> 2013-05-09 17:16:04,057 DEBUG [agent.manager.AgentAttache] 
> (AgentTaskPool-14:null) Seq 3-1752367195: Waiting some more time because this 
> is the current command
> 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request] 
> (AgentTaskPool-7:null) Seq 5-2082341068: Executing:  { Cmd , MgmtId: 
> 7647994577963, via: 5, Ver: v1, Flags: 100011, 
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}] }
> 2013-05-09 17:16:04,057 DEBUG [agent.manager.DirectAgentAttache] 
> (DirectAgent-91:null) Seq 5-2082341068: Executing request
> 2013-05-09 17:16:05,175 DEBUG [storage.secondary.SecondaryStorageManagerImpl] 
> (secstorage-1:null) Zone 1 is ready to launch secondary storage VM
> 2013-05-09 17:16:05,614 DEBUG [xen.resource.XenServerConnectionPool] 
> (DirectAgent-267:null) localLogout has problem Failed to read server's 
> response: connect timed out
> 2013-05-09 17:16:05,614 WARN  [agent.manager.DirectAgentAttache] 
> (DirectAgent-267:null) Seq 1-1435828229: Exception Caught while executing 
> command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of 
> slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: 
> Failed to read server's response: connect timed out
>         at 
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
>         at 
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:7725)
>         at 
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:570)
>         at 
> com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
>         at 
> com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
>         at 
> com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at 
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:05,615 DEBUG [agent.manager.DirectAgentAttache] 
> (DirectAgent-267:null) Seq 1-1435828229: Response Received:
> 2013-05-09 17:16:05,615 DEBUG [agent.transport.Request] 
> (DirectAgent-267:null) Seq 1-1435828229: Processing:  { Ans: , MgmtId: 
> 7647994577963, via: 1, Ver: v1, Flags: 10, 
> [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
>  Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to 
> org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect 
> timed out","wait":0}}] }
> 2013-05-09 17:16:05,704 DEB

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to