On 2011-06-27T12:00:28, Dominik Klein <dominik.kl...@googlemail.com> wrote:
> Now it sees NOT_RUNNING on all nodes during probe and may decide to > start the VM on a node where it cannot run. That, with the current > version of the agent, leads to a failed start, a failed stop during > recovery and therefore: an unnecessary stonith operation. Yes, the 'stop' in the old agent is/was broken. The probe, alas, can't explicitly check all pre-requisites, since they may not be online yet. It, perhaps, was a mistake to use a "monitor" as the "probe", with 20:20 hindsight. It seemed an improvement at the time, but nowadays I'm no longer so sure; it requires the "ocf_is_probe" special case that I'm not so fond of and leads to discussions like this. ;-) Dejan is correct: unless the "monitor" op during probe has more evidence than a missing file, it probably shouldn't return "ERR_INSTALLED" (nor _CONFIGURED); that'll block the resource from the node completely. It _is_ a valid return code of course, but inappropriate for bits that could be on shared storage and simply missing. Actually, all we _must_ know for "monitor_0" is if the resource is active in any capacity. Any further requirements probably are best checked at "start" time. > I think the correct way to fix this is to still return ERR_INSTALLED > during probe unless the cluster admin configures that the VMs config is > on shared storage. Finding out about resource states on different nodes > is what the probe was designed to do, was it not? And we work around > that in this resource agent just to support certain setups. Yeah, and that is a pretty depressing result. But I definitely dislike the special switch for telling the cluster that the config is on shared storage like that. That would be a scenario that no admin would test. So it seems defining a specific "probe" operation would appear to be a good idea going forward; it can, in fact, do exactly the same thing as a "monitor" (if it has enough definite evidence), but it would be more obvious that the emphasis is different. And hopefully be less confusing. Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/