Github user resmo commented on the pull request:
https://github.com/apache/cloudstack/pull/829#issuecomment-142882541
@anshul1886 @koushik-das
@DaanHoogland and I had a debug session last friday, and since he is off
for the next couple of days I can give you more details about we analysed.
The powerReportMissing is not the problem, it is only the trigger. The
graceful period is the problem. The calculation of this period is relaying (see
https://github.com/apache/cloudstack/blob/4.5.2/engine/orchestration/src/com/cloud/vm/VirtualMachinePowerStateSyncImpl.java#L114)
on the field `update_time` in table `vm_instance`. But if I look at the value
it seems it doesn't get updated. So the grace period has most likely always
passed.
I tried to do a workaround doing the following, I ran an update sql for
every 5 seconds which updated the `update_time` for my router r-342 which I was
migrating around esx cluster nodes:
~~~
mysql -e 'update cloud.vm_instance set update_time=NOW() where id=342;'
~~~
And the router didn't get rebooted:
~~~
2015-09-24 11:47:07,685 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-218:ctx-5849bd19) VM state report. host: 25, vm id: 342,
power state: PowerOn
2015-09-24 11:47:07,696 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-218:ctx-5849bd19) VM state report is updated. host: 25, vm
id: 342, power state: PowerOn
2015-09-24 11:48:06,462 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-55:ctx-84cd4323) VM state report. host: 19, vm id: 342,
power state: PowerOn
2015-09-24 11:48:06,471 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-55:ctx-84cd4323) VM state report is updated. host: 19, vm
id: 342, power state: PowerOn
2015-09-24 11:48:06,493 WARN [o.a.c.alerts]
(DirectAgentCronJob-55:ctx-84cd4323) alertType:: 9 // dataCenterId:: 1 //
podId:: 1 // clusterId:: null // message:: Router has been migrated out of
band: r-342-VM
2015-09-24 11:49:06,539 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-29:ctx-2a57d676) Detected missing VM. host: 19, vm id: 342,
power state: PowerReportMissing, last state update: 1443095344000
2015-09-24 11:49:06,539 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-29:ctx-2a57d676) vm id: 342 - time since last state
update(-7197461ms) has not passed graceful period yet
2015-09-24 11:49:07,719 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-444:ctx-fdd4c055) VM state report. host: 20, vm id: 342,
power state: PowerOn
~~~
Which means this patch is not fix the root cause. To me the root cause is
that `update_time` is not updated or the gracePeriod calculation is wrong.
Any thoughts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---