On 03/20/2014 12:29 PM, Chris Friesen wrote:
The fact that there are no success or error logs in nova-compute.log
makes me wonder if we somehow got stuck in self.driver.reboot().
Also, I'm kind of wondering what would happen if nova-compute was
running reboot_instance() and we rebooted the controller at the same
time. reboot_instance() could time out trying to update the instance
with the the new power state and a task_state of None. Later on in
_sync_power_states() we would update the power_state, but nothing would
update the task_state. I don't think this is what happened to us though
since I'd expect to see logs of the timeout.
Actually, looking at the logs a bit more carefully it appears that what
happened is something like this:
We reboot the controllers.
Right after they come back up something calls compute.api.API.reboot()
That sets instance.task_state = task_states.REBOOTING and then calls
instance.save() to update the database.
Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
That message gets dropped on the floor due to communication issues
between the controller and the compute.
Now we're stuck with a task_state of REBOOTING.
I think that both of the RPC message loss scenarios are valid with
current nova code, so we really do need an audit to clean up after this
sort of thing.
Chris
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev