On 2011-10-28T08:36:12, Ulrich Windl <[email protected]> wrote:

> > OCF_ERR_GENERIC should only be returned if it is up in some form and not
> > cleanly stopped nor running.
> Exactly: There are some "resources" that consist of several services 
> (processes) where it's possible that some services are up, and some services 
> are down, but not all of them. I had exactly that case.
> 
> The current monitor can be interpreted this way:
> rc=0: "everything is up" or "not everything is down"
> rc!=0: "everything is down" or "not everything is up"
> 
> So there is no rc for "something is up, something is down" (indetermined 
> state).

That is incorrect. The >0 rcs cover this; OCF_ERR_GENERIC is actually a
pretty good fit.

> And the monitor should have an additional "undetermined" state that is 
> different from the "inability to monitor". So a change from "started" to 
> "undetermined" is most likely a "stopping" state, while a change from 
> "stopped" to "undetermined" most likely means "starting".

A resource can't be in this state. We never issue a monitor while a
start/stop/promote/demote op is running, and these are required to leave
the resource in a determined state; so a legitimate monitor will never
see a resource in a transition.

Any 'transition' is a failure somewhere and should be treated as such.

> > > Now I'm afraid if the status/monitor returns OCF_ERR_GENERIC on a probe 
> > > the 
> > node is fenced: LRM will try to stop that resource, but the stop will 
> > return 
> > OCF_ERR_INSTALLED, causing a fence. Right?
> > 
> > Depends. "stop" should return success if the service is cleanly stopped,
> > regardless of whether binaries etc are present or not.
> 
> OK, but if you need the binary to determine the state, you are having a 
> problem (which is the case with some commercial software). You might argue 
> that without the binary the software isn't installed, and thus cannot be 
> running. But the you just pushed the problem to the "start" method (which 
> would fail to start the "not running" resource).

"binary not present implies software isn't running" is mostly
legitimate, though it fails if the binary was deleted - which is
admittedly a corner case. But ps/netcat may be better at probing than
requiring the binary, yes. It all depends on the resource.

But yes, if the resource follows a deployment model where probe can't
determine if OCF_ERR_INSTALLED/CONFIGURED are the codes to return, it
does get pushed back to the "start" operation, which is perfectly
fine.

> > Returning OCF_ERR_GENERIC for the startup probe is a bad idea, because
> > it'll trigger the multi-node recovery logic. (Unless, of course, it is
> > indeed up.)
> Yes, I'm retuning "not running" when unable to determine the state, but 
> that's not 100% clean.

That is actually wrong. State unknown needs to be treated as
failed/running (i.e., ERR_GENERIC), to avoid concurrency violations.

> I know, the monitor can return anything, but the question is who will handle 
> the return code. I don't know. Maybe a table of methods and allowed return 
> codes would be helpful.

All the return codes are theoretically allowed, they just make more
sense in different situations.

We all look forward to your contributions to the documentation! ;-)


Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to