23.02.2012 10:43, Ante Karamatic wrote: > On 23.02.2012 07:57, Vladislav Bogdanov wrote: > >> Thanks for clarification, that wasn't clear at the moment I looked at >> it. If I knew that, I wouldn't write that RA. One remark, my RA has >> possibility to check service aliveness on monitor operation and repair >> that service if it hangs. > > Well... Upstart actually does notice if the job failed and respawns it - > depending on job's configuration. Monitoring cluster resource, in this > case, should just return 'running' or 'not running'. It's up to the lrmd > to restart the resource if it's not running. Restarting the resource > within the 'monitor' doesn't look like the best way to do it? It somehow > doesn't fit into the 'monitor' function and you lose some of the > functionality when you don't report the problem to the lrmd (allowed > number of restarts; what to do if monitor fails, etc...).
Well, monitor failure will cause all dependent resources to be restarted by pacemaker, which is not always desired. As some resources (like libvirtd or iscsid or ietd) support restarts without affecting functionality at all, I prefer them to be restarted automatically by upstart, not by pacemaker. That's why I use 'respawn' there. Of course not all resources support that. What I said above is not about resource NOT_RUNNING failure, but about HANG failure. Imagine daemon which still runs (has a process) but does not answer to requests. That is not notified by upstart. But in a case of libvirtd that will be notified by VirtualDomain RA and will cause monitor ERR_GENERIC (if I recall correctly) failure. VM then will be scheduled to restart. Then it fails on stop because libvirtd still doesn't answer, then node is fenced. I was hit by this once, and that was a simple growth problem - libvirtd has a limit on a number of connections. More resources (VMs) you have, bigger the chance that you consume all connection slots for monitor operations. And I think that having libvirtd killed -9 by its RA on monitor (and respawned by upstart) is a way less evil than to have whole cluster forcibly restarted. Yes, this is a hack. But it works and allows me to sleep. Of course that does not replace need in a proper configuration, just a one more safety layer... > >> I use it for libvirtd which sometimes become >> unresponsive so I need to restart it before all other libvirt-related >> resources begin to fail. Fortunately, modern libvirtd can be restarted >> without affecting guests. Of course, that is just a hack, and that >> should be fixed in libvirtd, but we live in a real world... > > You can prevent other resources from restarting by adjusting > constraints. But this really depends on your setup. For some time > running libvirtd is not a requirement for running a VM. I don't recall > VMs ever failing if libvirt restarted. I know. But libvirtd is required to start/stop a libvirt-managed VM. That's why one needs a constraints to colocate VM with libvirtd instance. It is currently impossible to specify that something is needed to start/stop resource but is not needed while it runs (btw in the case of libvirtd it *is* needed to obtain resource status). So constraints must be there. But then, if pacemaker notifies that resource (libvirtd) is not running it will stop all dependent resources (VMs) and then restart failed one. And it will fail to stop that resources (because libvirtd is still not running) and node will be fenced. Vladislav _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org