On 03/29/2015 09:26 PM, Mike Dorman wrote:
Hi all,
I’m curious about how people deal with failures of compute nodes, as in total
failure when the box is gone for good. (Mainly care about KVM HV, but also
interested in more general cases as well.)
The particular situation we’re looking at: how end users could identify or be
notified of VMs that no longer exist, because their hypervisor is dead. As I
understand it, Nova will still believe VMs are running, and really has no way to
know anything has changed (other than the nova-compute instance has dropped
off.)
I understand failure detection is a tricky thing. But it seems like there must
be something a little better than this.
This is a timely question...I was wondering if it might make sense to upstream
one of the changes we've made locally.
We have an external entity monitoring the health of compute nodes. When one of
them goes down we automatically take action regarding the instances that had
been running on it.
Normally nova won't let you evacuate an instance until the compute node is
detected as "down", but that takes 60 sec typically and our software knows the
compute node is gone within a few seconds.
The change we made was to patch nova to allow the health monitor to explicitly
tell nova that the node is to be considered "down" (so that instances can be
evacuated without delay). When the external monitoring entity detects that the
compute node is back, it tells nova the node may be considered "up" (if nova
agrees that it's "up").
Is this ability to tell nova that a compute node is "down" something that would
be of interest to others?
Chris
_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators