++Koushik Thanks Sowmya
Koushik Any reason why this is cannot be applied to ACS 4.1? Since I'm Release Manager for 4.1.x branch, I will work on testing and releasing 4.1.2 Thanks ilya > -----Original Message----- > From: Sowmya Krishnan [mailto:[email protected]] > Sent: Thursday, September 26, 2013 3:03 AM > To: [email protected] > Subject: RE: [MAJOR][BUG] ACS powers off some VMs in vSphere - when MS > service is restarted > > Ilya, > Is this similar to this issue: > https://issues.apache.org/jira/browse/CLOUDSTACK-4636 > Fix for this has gone in 4.2-forward and master. As stated in the bug, this > could happen if more than one thread tries to process the same host post > MS restart. > > > -----Original Message----- > > From: Musayev, Ilya [mailto:[email protected]] > > Sent: Thursday, September 26, 2013 3:43 AM > > To: [email protected] > > Subject: RE: [MAJOR][BUG] ACS powers off some VMs in vSphere - when > MS > > service is restarted > > > > Sorry if the error log is abit hard to read, as an example, please > > track the vmname i-2-262-acs-docs-fc17. > > > > 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] > > (AgentTaskPool-1:null) Detecting a new state but couldn't find a old > > state so adding it to the changes: i-2-262-acs-docs-fc17 > > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running > > and realState = Stopped > > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running > > and realState = Stopped > > 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] > > (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17 > > 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] > > (AgentTaskPool- > > 1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: > > 11, > > Ver: v1, Flags: 100101, > > [{"StopCommand":{"isProxy":false,"vmName":"i-2-262- > > acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i- > > 2- > > 278-demo01t-ops- > > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs- > > appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-28 > > 3- > > vmbld01l-ops- > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":" > > i-2- > > 285-ossec01l-ops-08","wait":0}}] } > > 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] > > (AgentTaskPool- > > 1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, > > via: 11, > > Ver: v1, Flags: 100101, > > [{"StopCommand":{"isProxy":false,"vmName":"i-2-262- > > acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i- > > 2- > > 278-demo01t-ops- > > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs- > > appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-28 > > 3- > > vmbld01l-ops- > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":" > > i-2- > > 285-ossec01l-ops-08","wait":0}}] } > > 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] > > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource > > StopCommand: > > {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0} > > 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent- > > 3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on > > host > > 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] > > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM > > i-2-262-acs-docs-fc17 is already in stopped state > > > > > -----Original Message----- > > > From: Musayev, Ilya [mailto:[email protected]] > > > Sent: Wednesday, September 25, 2013 6:08 PM > > > To: [email protected] > > > Subject: [MAJOR][BUG] ACS powers off some VMs in vSphere - when MS > > > service is restarted > > > > > > Hi All, > > > > > > I'd like to raise an awareness on the issue with ACS 4.1.1 (I assume > > > other versions are affected - since I could not find any changes in > > > the latest code that would state otherwise). > > > > > > On MS server start/restart, it checks all the VMs for their state, > > > if some reason state is either not found or comes as stopped - which > > > is inaccurate, the vms will *power off*. > > > > > > On todays occurrence, half of my plant went down because ACS > > > invoked StopCommand on the vms that either had no state or for some > > > reason agentState == Stopped. > > > > > > The details of this issue is in > > > https://issues.apache.org/jira/browse/CLOUDSTACK-4740 > > > > > > The error in the logs is: > > > > > > 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] > > > (AgentTaskPool-1:null) Detecting a new state but couldn't find a old > > > state so adding it to the changes: i-2-262-acs-docs-fc17 > > > 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] > > > (AgentTaskPool- > > > 1:null) Seq -1--1: Startup request from directly connected host: { > > > Cmd , > > > MgmtId: -1, via: -1, Ver: v1, Flags: 11, > > > > [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"po > > > olSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15- > > > CLOUD41-OLD":{"state":"Stopped"},"v-260- > > > CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops- > > > 08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1 > > > - > > > CLOUD41-OLD":{"state":"Running"},"i-27-280- > > > CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops- > > > 08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"}, > > > "i > > > -24- > > > 265-test3":{"state":"Running"},"cloud01l-ops- > > > 08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops- > > > 08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66- > > > CLOUD411":{"state":"Running"},"i-2-281-acs- > > > > appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware"," > > > hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":" > > > Dc > > > Gl > > > obal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing", > > > "d > > > ataC > > > enter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc0 > > > 0q > > > - > > > ops-08.portal.webmd.com","name":"vmha62d-ops- > > > 08.portal.webmd.com","version":"4.1.1- > > > > SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68 > > > :b > > > 5 > > > > :99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25. > > > 243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5: > > > 99 > > > :7 > > > 3:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4- > > > 45 > > > 69- > > > ac51-adc5af770bf6","host":"vmha62d-ops- > > > 08.portal.webmd.com","localPath":"","hostPath":"datastore- > > > > 19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes": > > > 14 > > > > 0383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"S > > > to > > > r > > > age","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4- > > > 45 > > > 69- > > > ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51- > > > adc5af770bf6","wait":0}] } > > > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > > > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running > > > and realState = Stopped > > > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > > > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running > > > and realState = Stopped > > > 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] > > > (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17 > > > 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] > > > (AgentTaskPool- > > > 1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: > > > 11, > > > Ver: v1, Flags: 100101, > > > [{"StopCommand":{"isProxy":false,"vmName":"i-2- > > > 262-acs-docs- > > > fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278- > > > demo01t-ops- > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName": > > > "i > > > - > > > 2-281-acs- > > > appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2- > > > 28 > > > 3- > > > vmbld01l-ops- > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":" > > > i- > > > 2-285-ossec01l-ops-08","wait":0}}] } > > > 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] > > > (AgentTaskPool- > > > 1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, > > > via: 11, > > > Ver: v1, Flags: 100101, > > > [{"StopCommand":{"isProxy":false,"vmName":"i-2- > > > 262-acs-docs- > > > fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278- > > > demo01t-ops- > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName": > > > "i > > > - > > > 2-281-acs- > > > appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2- > > > 28 > > > 3- > > > vmbld01l-ops- > 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":" > > > i- > > > 2-285-ossec01l-ops-08","wait":0}}] } > > > 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] > > > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource > > > StopCommand: > > > {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0} > > > 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent- > > > 3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on > > > host > > > 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] > > > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs- > docs- > > > fc17 is already in stopped state > > > 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] > > > (DirectAgent- > > > 3:null) Seq 11-1418264581: Processing: { Ans: , MgmtId: > > > 345049078181, > > > via: 11, > > > Ver: v1, Flags: 100, > > > [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM > > > i-2-262-acs-docs-fc17 is already in stopped > > > state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":" > > > Stop > > > VM i-2-278-demo01t-ops-08 > > > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details" > > > :"Stop > > > VM i-2-281-acs-appliance > > > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details" > > > :"Stop > > > VM i-2-283-vmbld01l-ops-08 > > > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details" > > > :"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] } > > > 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent- > > > 20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on > > > host > > > > > > I wonder why this was not noted earlier. > > > > > > The code in question is below, and comes from > > > > engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java > > > on master branch. > > > > > > 2365 if (serverState == State.Running) { > > > 2366 // Our records showed that it should be running so let's > > > restart > > > 2367 // it. > > > 2368 _haMgr.scheduleRestart(vm, false); > > > 2369 } else if (serverState == State.Stopping) { > > > 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop); > > > 2371 s_logger.debug("Scheduling a check stop for VM in stopping mode: > > > " + vm); > > > 2372 } else if (serverState == State.Starting) { > > > 2373 s_logger.debug("Ignoring VM in starting mode: " + > > > vm.getInstanceName()); > > > 2374 _haMgr.scheduleRestart(vm, false); > > > 2375 } > > > 2376 command = cleanup(vm); > > > > > > If there is a quick fix, which I think for now, I can just comment > > > out the "command= cleanup(vm)", please let me know. I'm not certain > > > what would be the repercussions. > > > > > > As always any feedback, is appreciated. > > > > > > Thanks > > > ilya > > > > > > > > >
