Re: [Openstack-operators] What to do when a compute node dies?

Jay Pipes Mon, 30 Mar 2015 13:51:30 -0700

On 03/30/2015 10:42 AM, Chris Friesen wrote:

On 03/29/2015 09:26 PM, Mike Dorman wrote:

Hi all,


I’m curious about how people deal with failures of compute nodes,
as in total failure when the box is gone for good.  (Mainly care
about KVM HV, but also interested in more general cases as well.)

The particular situation we’re looking at: how end users could
identify or be notified of VMs that no longer exist, because their
hypervisor is dead.  As I understand it, Nova will still believe
VMs are running, and really has no way to know anything has changed
(other than the nova-compute instance has dropped off.)

I understand failure detection is a tricky thing.  But it seems
like there must be something a little better than this.


This is a timely question...I was wondering if it might make sense to
upstream one of the changes we've made locally.

We have an external entity monitoring the health of compute nodes.
When one of them goes down we automatically take action regarding the
instances that had been running on it.

Normally nova won't let you evacuate an instance until the compute
node is detected as "down", but that takes 60 sec typically and our
software knows the compute node is gone within a few seconds.

Any external monitoring solution that detects the compute node is "down"could issue a call to `nova evacuate $HOST`.

The question I have for you is what does your software consider as a"downed" node? Is it some heartbeat-type stuff in network connectivity?A watchdog in KVM? Some proactive monitoring of disk or memory faults?Some combination? Something entirely different? :)

The change we made was to patch nova to allow the health monitor to
explicitly tell nova that the node is to be considered "down" (so
that instances can be evacuated without delay).

Why was it necessary to modify Nova for this? The external monitoringscript could easily do: `nova service-disable $HOST nova-compute` andthat immediately takes the compute node out of service and enablesevacuation.


> When the external

monitoring entity detects that the compute node is back, it tells
nova the node may be considered "up" (if nova agrees that it's
"up").


You mean `nova service-disable $HOST nova-compute`?

Is this ability to tell nova that a compute node is "down" something
 that would be of interest to others?

Unless I'm mistaken, `nova service-disable $HOST nova-compute` alreadyexists that does this?


Best,
-jay

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] What to do when a compute node dies?

Reply via email to