On 2011-10-28T08:36:12, Ulrich Windl <[email protected]> wrote:
> > OCF_ERR_GENERIC should only be returned if it is up in some form and not
> > cleanly stopped nor running.
> Exactly: There are some "resources" that consist of several services
> (processes) where it's possible that some services are up, and some services
> are down, but not all of them. I had exactly that case.
>
> The current monitor can be interpreted this way:
> rc=0: "everything is up" or "not everything is down"
> rc!=0: "everything is down" or "not everything is up"
>
> So there is no rc for "something is up, something is down" (indetermined
> state).
That is incorrect. The >0 rcs cover this; OCF_ERR_GENERIC is actually a
pretty good fit.
> And the monitor should have an additional "undetermined" state that is
> different from the "inability to monitor". So a change from "started" to
> "undetermined" is most likely a "stopping" state, while a change from
> "stopped" to "undetermined" most likely means "starting".
A resource can't be in this state. We never issue a monitor while a
start/stop/promote/demote op is running, and these are required to leave
the resource in a determined state; so a legitimate monitor will never
see a resource in a transition.
Any 'transition' is a failure somewhere and should be treated as such.
> > > Now I'm afraid if the status/monitor returns OCF_ERR_GENERIC on a probe
> > > the
> > node is fenced: LRM will try to stop that resource, but the stop will
> > return
> > OCF_ERR_INSTALLED, causing a fence. Right?
> >
> > Depends. "stop" should return success if the service is cleanly stopped,
> > regardless of whether binaries etc are present or not.
>
> OK, but if you need the binary to determine the state, you are having a
> problem (which is the case with some commercial software). You might argue
> that without the binary the software isn't installed, and thus cannot be
> running. But the you just pushed the problem to the "start" method (which
> would fail to start the "not running" resource).
"binary not present implies software isn't running" is mostly
legitimate, though it fails if the binary was deleted - which is
admittedly a corner case. But ps/netcat may be better at probing than
requiring the binary, yes. It all depends on the resource.
But yes, if the resource follows a deployment model where probe can't
determine if OCF_ERR_INSTALLED/CONFIGURED are the codes to return, it
does get pushed back to the "start" operation, which is perfectly
fine.
> > Returning OCF_ERR_GENERIC for the startup probe is a bad idea, because
> > it'll trigger the multi-node recovery logic. (Unless, of course, it is
> > indeed up.)
> Yes, I'm retuning "not running" when unable to determine the state, but
> that's not 100% clean.
That is actually wrong. State unknown needs to be treated as
failed/running (i.e., ERR_GENERIC), to avoid concurrency violations.
> I know, the monitor can return anything, but the question is who will handle
> the return code. I don't know. Maybe a table of methods and allowed return
> codes would be helpful.
All the return codes are theoretically allowed, they just make more
sense in different situations.
We all look forward to your contributions to the documentation! ;-)
Regards,
Lars
--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems