[
https://issues.apache.org/jira/browse/CLOUDSTACK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073335#comment-14073335
]
Tomasz Zieba commented on CLOUDSTACK-6036:
------------------------------------------
Hello,
After a little investigation, we have some facts. Error stopping the machine is
always associated with an entry in the logs of the content:
2014-07-24 17:00:34,918 DEBUG [vm.dao.VMInstanceDaoImpl] (HA-Worker-2:work-15)
Unable to update VM[User|Win-HD-tune]: DB Data={Host=null; State=Stopped;
updated=55; time=Thu Jul 24 17:00:34 CEST 2014} New Data: {Host=2;
State=Running; updated=55; time=Thu Jul 24 17:00:34 CEST 2014} Stale Data:
{Host=2; State=Stopping; updated=54; time=Thu Jul 24 17:00:13 CEST 2014}
and then:
2014-07-24 17:00:34,919 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-2:work-15) Stop was unsuccessful. Rescheduling
2014-07-24 17:00:34,919 INFO [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-2:work-15) Rescheduling HAWork[15-Stop-10-Stopping-Scheduled] to try
again at Thu Jul 24 17:10:48 CEST 2014
After analyzing the logs of the database:
633 Query UPDATE vm_instance SET vm_instance.update_time='2014-07-24
15:00:34', vm_instance.update_count=vm_instance.update_count+1,
vm_instance.last_host_id=2, vm_instance.state='Stopped',
vm_instance.host_id=null, vm_instance.pod_id=1 WHERE vm_instance.id = 10 AND
vm_instance.state = 'Stopping' AND vm_instance.host_id = 2 AND
vm_instance.update_count = 54
633 Query UPDATE vm_instance SET vm_instance.update_time='2014-07-24
15:00:34', vm_instance.update_count=vm_instance.update_count+1,
vm_instance.state='Running', vm_instance.host_id=2, vm_instance.pod_id=1 WHERE
vm_instance.id = 10 AND vm_instance.state = 'Stopping' AND
vm_instance.host_id = 2 AND vm_instance.update_count = 54
As can be seen ACS tries to update the entry twice but twice looking
vm_instance.update_count field with a value of 54
So the first update causes the counter vm_instance.update_count increased to 55
so once the second SQL Update command executed incorrectly.
My guess is that the situation is related to a race condition in the HA-Worker
threads, which by default is 5
In the test environment, we try to change ha.workers value to 1 and we will see
if the situation repeats itself.
> CloudStack stops the machine for no reason
> -------------------------------------------
>
> Key: CLOUDSTACK-6036
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6036
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.2.1
> Environment: ACS 4.2.1 after upgrade from 3.0.2
> ACS 4.2.1 clean install
> XCP 1.1
> Reporter: Tomasz Zieba
> Assignee: Koushik Das
> Priority: Critical
> Labels: 4.5ReviewNeeded
> Fix For: 4.4.0
>
> Attachments: management-server.log.2014-02-19.gz,
> management-server.log.2014-02-20.gz, management-server.log.2014-02-24.txt
>
>
> After the upgrade from version 3.0.2 to 4.2.1 appeared very strange error
> associated with self-stopping machine after changing the offering.
> Steps to reproduce:
> 1. Running instance of the machine
> 2. Stop with the operating system
> 3. Change offering of machine
> 4. Start the machine
> 5. Waiting for 10 minutes
> 6. CloudStack stops the machine for no reason
> 7. Restart the machine
> In the logs you can find information:
> 2014-02-05 06:27:00,974 DEBUG [xen.resource.CitrixResourceBase]
> (DirectAgent-316:null) 11. The VM i-41-824-VM is in Running state
> 2014-02-05 06:27:00,974 DEBUG [agent.transport.Request]
> (DirectAgent-316:null) Seq 50-1756626952: Processing: { Ans: , MgmtId:
> 160544475005497, via: 50, Ver: v1, Flags: 10,
> [{"com.cloud.agent.api.ClusterSyncAnswer":{"_clusterId":2,"_newStates":{"i-41-824-VM":{"t":"f32a4fee-b64e-4868-9c52-2a27e32d4c0e","u":"Running","v":"viridian:true;acpi:true;apic:true;pae:true;nx:false;timeoffset:0;cores-per-socket:4"}},"_isExecuted":false,"result":true,"wait":0}}]
> }
> 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (DirectAgent-316:null) VM i-41-824-VM: cs state = Running and realState =
> Running
> 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (DirectAgent-316:null) VM i-41-824-VM: cs state = Running and realState =
> Running
> 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request]
> (HA-Worker-1:work-1511) Seq 51-1374970375: Sending { Cmd , MgmtId:
> 160544475005497, via: 51, Ver: v1, Flags: 100111,
> [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}]
> }
> 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request]
> (HA-Worker-1:work-1511) Seq 51-1374970375: Executing: { Cmd , MgmtId:
> 160544475005497, via: 51, Ver: v1, Flags: 100111,
> [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}]
> }
> 2014-02-05 06:36:01,383 DEBUG [xen.resource.CitrixResourceBase]
> (DirectAgent-150:null) 9. The VM i-41-824-VM is in Stopping state
> 2014-02-05 06:36:27,625 DEBUG [xen.resource.CitrixResourceBase]
> (DirectAgent-150:null) 10. The VM i-41-824-VM is in Stopped state
> You will notice that the stop of the machine corresponds to the component
> HA-Worker.
> Such operation after the upgrade is complicating work of users.
--
This message was sent by Atlassian JIRA
(v6.2#6252)