Re: [Openstack-operators] What to do when a compute node dies?

Chris Friesen Mon, 30 Mar 2015 15:45:49 -0700

On 03/30/2015 02:47 PM, Jay Pipes wrote:

On 03/30/2015 10:42 AM, Chris Friesen wrote:

On 03/29/2015 09:26 PM, Mike Dorman wrote:

Hi all,


I’m curious about how people deal with failures of compute nodes,
as in total failure when the box is gone for good.  (Mainly care
about KVM HV, but also interested in more general cases as well.)

The particular situation we’re looking at: how end users could
identify or be notified of VMs that no longer exist, because their
hypervisor is dead.  As I understand it, Nova will still believe
VMs are running, and really has no way to know anything has changed
(other than the nova-compute instance has dropped off.)

I understand failure detection is a tricky thing.  But it seems
like there must be something a little better than this.


This is a timely question...I was wondering if it might make sense to
upstream one of the changes we've made locally.

We have an external entity monitoring the health of compute nodes.
When one of them goes down we automatically take action regarding the
instances that had been running on it.

Normally nova won't let you evacuate an instance until the compute
node is detected as "down", but that takes 60 sec typically and our
software knows the compute node is gone within a few seconds.


Any external monitoring solution that detects the compute node is "down" could
issue a call to `nova evacuate $HOST`.

The question I have for you is what does your software consider as a "downed"
node? Is it some heartbeat-type stuff in network connectivity? A watchdog in
KVM? Some proactive monitoring of disk or memory faults? Some combination?
Something entirely different? :)

Combination of the above. A local entity monitors "critical stuff" on thecompute node, and heartbeats with a control node via one or more network links.

The change we made was to patch nova to allow the health monitor to
explicitly tell nova that the node is to be considered "down" (so
that instances can be evacuated without delay).


Why was it necessary to modify Nova for this? The external monitoring script
could easily do: `nova service-disable $HOST nova-compute` and that immediately
takes the compute node out of service and enables evacuation.

Disabling the service is not sufficient. compute.api.API.evacuate() throws anexception if servicegroup.api.API.service_is_up(service) is true.

 > When the external

monitoring entity detects that the compute node is back, it tells
nova the node may be considered "up" (if nova agrees that it's
"up").


You mean `nova service-disable $HOST nova-compute`?

Is this ability to tell nova that a compute node is "down" something
 that would be of interest to others?


Unless I'm mistaken, `nova service-disable $HOST nova-compute` already exists
that does this?

No, what we have is basically a way to causeservicegroup.api.API.service_is_up() to return false. That causes the correctstatus to be displayed in the "State" column in the output of "novaservice-list" and allows evacuation to proceed.


Chris


_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] What to do when a compute node dies?

Reply via email to