On 2011-06-27T12:00:28, Dominik Klein <dominik.kl...@googlemail.com> wrote:

> Now it sees NOT_RUNNING on all nodes during probe and may decide to
> start the VM on a node where it cannot run. That, with the current
> version of the agent, leads to a failed start, a failed stop during
> recovery and therefore: an unnecessary stonith operation.

Yes, the 'stop' in the old agent is/was broken.

The probe, alas, can't explicitly check all pre-requisites, since they
may not be online yet. It, perhaps, was a mistake to use a "monitor" as
the "probe", with 20:20 hindsight. It seemed an improvement at the time,
but nowadays I'm no longer so sure; it requires the "ocf_is_probe"
special case that I'm not so fond of and leads to discussions like this.
;-)

Dejan is correct: unless the "monitor" op during probe has more evidence
than a missing file, it probably shouldn't return "ERR_INSTALLED" (nor
_CONFIGURED); that'll block the resource from the node completely. It
_is_ a valid return code of course, but inappropriate for bits that
could be on shared storage and simply missing.

Actually, all we _must_ know for "monitor_0" is if the resource is
active in any capacity. Any further requirements probably are best
checked at "start" time.


> I think the correct way to fix this is to still return ERR_INSTALLED
> during probe unless the cluster admin configures that the VMs config is
> on shared storage. Finding out about resource states on different nodes
> is what the probe was designed to do, was it not? And we work around
> that in this resource agent just to support certain setups.

Yeah, and that is a pretty depressing result. But I definitely dislike
the special switch for telling the cluster that the config is on shared
storage like that. That would be a scenario that no admin would test.

So it seems defining a specific "probe" operation would appear to be a
good idea going forward; it can, in fact, do exactly the same thing as a
"monitor" (if it has enough definite evidence), but it would be more
obvious that the emphasis is different. And hopefully be less
confusing.


Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to