[
https://issues.apache.org/jira/browse/CLOUDSTACK-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sangeetha Hariharan reassigned CLOUDSTACK-2428:
-----------------------------------------------
Assignee: Koushik Das (was: Sangeetha Hariharan)
Koushik,
I am able to reproduce this issue again.
Following are the steps that were done:
Set up has Advanced zone with 2 Xenserver 6.2 hosts (with latest patches up to
XS62E009) in cluster.
I had few Vms deployed in both the hosts.
Powered off slave. ( using IPMI)
Host was marked down.
Powered on slave. ( using IPMI)
Host was marked up.
No issues was seen.
Destroyed all the existing use Vms.
Created new account and deployed few more HA enabled Vms in both the hosts.
Powered off master. ( using IPMI)
Now I see the same issue reported . Following exception seen in management
server logs:
2013-12-16 17:57:39,391 WARN [c.c.a.m.DirectAgentAttache]
(DirectAgent-266:ctx-f8d08ed2) Seq 1-1825046531: Exception Caught whil
e executing command
com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of
slave 10.223.59.66 to 10.223.59.67 due to org.apache.xmlrpc.XmlRpcException:
Failed to read server's response: connect timed out
at
com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
at
com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
at
com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5980)
at
com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:8243)
at
com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:588)
at
com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:59)
at
com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:106)
at
com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAgentAttache.java:216)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
2013-12-16 17:57:39,392 DEBUG [c.c.a.m.DirectAgentAttache]
(DirectAgent-266:ctx-f8d08ed2) Seq 1-1825046531: Response Received:
2013-12-16 17:57:39,392 DEBUG [c.c.a.t.Request] (DirectAgent-266:ctx-f8d08ed2)
Seq 1-1825046531: Processing: { Ans: , MgmtId: 112516401760401, via: 1, Ver:
v1, Flags: 10,
[{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
Unable to reset master of slave 10.223.59.66 to 10.223.59.67 due to
org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect
timed out","wait":0}}] }
Also seems like the xenserver has already switched the master. Cloudplatform is
still trying to log in to the host that is actually down ( previous master
current slave) and failing.
[root@Rack3Host23 ~]# xe pool-list
uuid ( RO) : e672311f-ce7e-0b94-7343-cad171f4da15
name-label ( RW):
name-description ( RW):
master ( RO): 8bc21b8a-87b5-4765-bdbe-200aefc1a98f
default-SR ( RW): f7cdbdc3-9629-227f-0c20-7df88e158998
[root@Rack3Host23 ~]# xe host-list
uuid ( RO) : eb0add58-3df6-4870-8937-ac21abe5471b
name-label ( RW): Rack3Host20.lab.vmops.com
name-description ( RW): Default install of XenServer
uuid ( RO) : 8bc21b8a-87b5-4765-bdbe-200aefc1a98f
name-label ( RW): Rack3Host23.lab.vmops.com
name-description ( RW): Default install of XenServer
Since I waited long enough this time , I see that the host was marked as "Down"
eventually (after about 5 hours)
The host was powered down at ~ 17:57 and CS marked this host as being down
after ~ 5 hours and 10 minutes.
2013-12-16 17:57:39,391 WARN [c.c.a.m.DirectAgentAttache]
(DirectAgent-266:ctx-f8d08ed2) Seq 1-182504
6531: Exception Caught while executing command
com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of
slave 10.223.59.66 to 10.223.59.67 due to org.apache.xmlrpc.XmlRpcException:
Failed to read server's response: connect timed out
2013-12-16 22:07:00,639 DEBUG [c.c.n.NetworkUsageManagerImpl]
(AgentTaskPool-9:ctx-aa4cce4f) Disconnected called on 1 with status Down
Questions here are:
1. Why is there a delay of 5 hours to mark a host as “DOWN” ? Usually 3 ping
failures should have caused the Host to be marked as being in “Down” state ,
correct ?
2. What actually triggered the host to be be marked "Down" after 5 hours?
3. Is there some configurable parameter that can be tuned to a lower value to
get the host to be marked "Down" quicker?
> HA - When the master host is disconnected , the host status contines to
> remain in "Up" state because of
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of
> slave
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-2428
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2428
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.2.0, 4.3.0
> Environment: Build from pvaln
> Reporter: Sangeetha Hariharan
> Assignee: Koushik Das
> Priority: Critical
> Fix For: 4.2.0, 4.3.0
>
> Attachments: hostdown.rar, hostdown.rar, logs.rar, logs_7_29,
> management-server.rar
>
>
> 1. Advance zone with 1 cluster with 2 hosts. Create Shared network with
> private vlan.
> 2. Deploy few HA enabled Vms in this network.
> 3. pull network cable for one of the host.
> When cloudstack detects that the host is disconnected , it is not able to out
> the host in disconnected state and start HA for Vms that are HA enabeld,
> I see the following exception in the management server logs:
> 2013-05-09 17:15:55,576 DEBUG [agent.manager.DirectAgentAttache]
> (DirectAgent-267:null) Seq 1-1435828229: Executing request
> 2013-05-09 17:15:55,602 DEBUG [xen.resource.XenServerConnectionPool]
> (DirectAgent-267:null) Catch Exception:
> com.xensource.xenapi.Types$HostOffline Host is offline 10.223.81.62 due to
> You attempted an operation which involves a host which could not be contacted.
> 2013-05-09 17:15:55,603 DEBUG [xen.resource.XenServerConnectionPool]
> (DirectAgent-267:null) Trying to reset master of slave 10.223.81.62 to
> 10.223.81.61
> 2013-05-09 17:16:02,319 WARN [xen.resource.CitrixResourceBase]
> (DirectAgent-265:null) can not ping xenserver
> 520d4994-8b1f-4dda-b51d-2ee63750abf6
> 2013-05-09 17:16:02,319 WARN [agent.manager.DirectAgentAttache]
> (DirectAgent-265:null) Unable to get current status on 1
> 2013-05-09 17:16:02,321 INFO [agent.manager.AgentManagerImpl]
> (AgentTaskPool-11:null) Investigating why host 1 has disconnected with event
> AgentDisconnected
> 2013-05-09 17:16:02,321 DEBUG [agent.manager.AgentManagerImpl]
> (AgentTaskPool-11:null) checking if agent (1) is alive
> 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request]
> (AgentTaskPool-11:null) Seq 1-1435828294: Sending { Cmd , MgmtId:
> 7647994577963, via: 1, Ver: v1, Flags: 100011,
> [{"CheckHealthCommand":{"wait":50}}] }
> 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request]
> (AgentTaskPool-11:null) Seq 1-1435828294: Executing: { Cmd , MgmtId:
> 7647994577963, via: 1, Ver: v1, Flags: 100011,
> [{"CheckHealthCommand":{"wait":50}}] }
> 2013-05-09 17:16:02,323 DEBUG [agent.manager.DirectAgentAttache]
> (DirectAgent-271:null) Seq 1-1435828294: Executing request
> 2013-05-09 17:16:04,035 DEBUG [agent.manager.AgentAttache]
> (AgentTaskPool-10:null) Seq 6-474349576: Waiting some more time because this
> is the current command
> 2013-05-09 17:16:04,040 DEBUG [xen.resource.XenServerConnectionPool]
> (DirectAgent-268:null) localLogout has problem Failed to read server's
> response: connect timed out
> 2013-05-09 17:16:04,040 WARN [agent.manager.DirectAgentAttache]
> (DirectAgent-268:null) Seq 1-1435828292: Exception Caught while executing
> command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of
> slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException:
> Failed to read server's response: connect timed out
> at
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
> at
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524)
> at
> com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
> at
> com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
> at
> com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:04,041 DEBUG [agent.manager.DirectAgentAttache]
> (DirectAgent-268:null) Seq 1-1435828292: Response Received:
> 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request]
> (DirectAgent-268:null) Seq 1-1435828292: Processing: { Ans: , MgmtId:
> 7647994577963, via: 1, Ver: v1, Flags: 10,
> [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to
> org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect
> timed out","wait":0}}] }
> 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request]
> (AgentTaskPool-5:null) Seq 1-1435828292: Received: { Ans: , MgmtId:
> 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } }
> 2013-05-09 17:16:04,041 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-5:null) host (10.223.81.50) cannot be pinged, returning null
> ('I don't know')
> 2013-05-09 17:16:04,041 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-5:null) sending ping from (5) to agent's host ip address
> (10.223.81.50)
> 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request]
> (AgentTaskPool-5:null) Seq 5-2082341067: Sending { Cmd , MgmtId:
> 7647994577963, via: 5, Ver: v1, Flags: 100011,
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}] }
> 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request]
> (AgentTaskPool-5:null) Seq 5-2082341067: Executing: { Cmd , MgmtId:
> 7647994577963, via: 5, Ver: v1, Flags: 100011,
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}] }
> 2013-05-09 17:16:04,043 DEBUG [agent.manager.DirectAgentAttache]
> (DirectAgent-272:null) Seq 5-2082341067: Executing request
> 2013-05-09 17:16:04,053 DEBUG [xen.resource.XenServerConnectionPool]
> (DirectAgent-91:null) localLogout has problem Failed to read server's
> response: connect timed out
> 2013-05-09 17:16:04,053 WARN [agent.manager.DirectAgentAttache]
> (DirectAgent-91:null) Seq 1-1435828293: Exception Caught while executing
> command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of
> slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException:
> Failed to read server's response: connect timed out
> at
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
> at
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524)
> at
> com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
> at
> com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
> at
> com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:04,054 DEBUG [agent.manager.DirectAgentAttache]
> (DirectAgent-91:null) Seq 1-1435828293: Response Received:
> 2013-05-09 17:16:04,054 DEBUG [agent.transport.Request] (DirectAgent-91:null)
> Seq 1-1435828293: Processing: { Ans: , MgmtId: 7647994577963, via: 1, Ver:
> v1, Flags: 10,
> [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to
> org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect
> timed out","wait":0}}] }
> 2013-05-09 17:16:04,055 DEBUG [agent.transport.Request]
> (AgentTaskPool-7:null) Seq 1-1435828293: Received: { Ans: , MgmtId:
> 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } }
> 2013-05-09 17:16:04,055 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-7:null) host (10.223.81.52) cannot be pinged, returning null
> ('I don't know')
> 2013-05-09 17:16:04,055 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-7:null) sending ping from (5) to agent's host ip address
> (10.223.81.52)
> 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request]
> (AgentTaskPool-7:null) Seq 5-2082341068: Sending { Cmd , MgmtId:
> 7647994577963, via: 5, Ver: v1, Flags: 100011,
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}] }
> 2013-05-09 17:16:04,057 DEBUG [agent.manager.AgentAttache]
> (AgentTaskPool-14:null) Seq 3-1752367195: Waiting some more time because this
> is the current command
> 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request]
> (AgentTaskPool-7:null) Seq 5-2082341068: Executing: { Cmd , MgmtId:
> 7647994577963, via: 5, Ver: v1, Flags: 100011,
> [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}] }
> 2013-05-09 17:16:04,057 DEBUG [agent.manager.DirectAgentAttache]
> (DirectAgent-91:null) Seq 5-2082341068: Executing request
> 2013-05-09 17:16:05,175 DEBUG [storage.secondary.SecondaryStorageManagerImpl]
> (secstorage-1:null) Zone 1 is ready to launch secondary storage VM
> 2013-05-09 17:16:05,614 DEBUG [xen.resource.XenServerConnectionPool]
> (DirectAgent-267:null) localLogout has problem Failed to read server's
> response: connect timed out
> 2013-05-09 17:16:05,614 WARN [agent.manager.DirectAgentAttache]
> (DirectAgent-267:null) Seq 1-1435828229: Exception Caught while executing
> command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of
> slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException:
> Failed to read server's response: connect timed out
> at
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
> at
> com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:7725)
> at
> com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:570)
> at
> com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
> at
> com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
> at
> com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:05,615 DEBUG [agent.manager.DirectAgentAttache]
> (DirectAgent-267:null) Seq 1-1435828229: Response Received:
> 2013-05-09 17:16:05,615 DEBUG [agent.transport.Request]
> (DirectAgent-267:null) Seq 1-1435828229: Processing: { Ans: , MgmtId:
> 7647994577963, via: 1, Ver: v1, Flags: 10,
> [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to
> org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect
> timed out","wait":0}}] }
> 2013-05-09 17:16:05,704 DEB
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)