Re: [Linux-HA] Antw: Re: Q: OCF_ERR_INSTALLED and status/monitor

Ulrich Windl Wed, 02 Nov 2011 00:53:57 -0700

>>> Lars Marowsky-Bree <[email protected]> schrieb am 28.10.2011 um 17:17 in 
>>> Nachricht
<[email protected]>:
> On 2011-10-28T08:36:12, Ulrich Windl <[email protected]> 
> wrote:
> 
> > > OCF_ERR_GENERIC should only be returned if it is up in some form and not
> > > cleanly stopped nor running.
> > Exactly: There are some "resources" that consist of several services 
> (processes) where it's possible that some services are up, and some services 
> are down, but not all of them. I had exactly that case.
> > 
> > The current monitor can be interpreted this way:
> > rc=0: "everything is up" or "not everything is down"
> > rc!=0: "everything is down" or "not everything is up"
> > 
> > So there is no rc for "something is up, something is down" (indetermined 
> state).
> 
> That is incorrect. The >0 rcs cover this; OCF_ERR_GENERIC is actually a
> pretty good fit.


Well the CRM will cause a  node fence then. In most case you don't like that, 
when actually a (hopefully) promising solution would be to either wait some 
more time, or jus tretry the last operation, i.e. Start again or stop again.

> 
> > And the monitor should have an additional "undetermined" state that is 
> different from the "inability to monitor". So a change from "started" to 
> "undetermined" is most likely a "stopping" state, while a change from 
> "stopped" to "undetermined" most likely means "starting".
> 
> A resource can't be in this state. We never issue a monitor while a
> start/stop/promote/demote op is running, and these are required to leave
> the resource in a determined state; so a legitimate monitor will never
> see a resource in a transition.

Believe me: With commercial software a resource can be in that state. As it 
turned out that state is triggered by terminal width actually ;-) (If the 
terminal is less or equal 80 characters, one process would not be detected as 
"up". They used "ps -ef" internally to look up processes, but that command 
(without "-w") truncates output at $COLUMNS)

> 
> Any 'transition' is a failure somewhere and should be treated as such.

Yes, bit the "starting" and "stopping" transitions would be quite helpful. Just 
think of crm_mon: Resources are listed either as started or stopped, while 
actually they may be in transition already.

> 
> > > > Now I'm afraid if the status/monitor returns OCF_ERR_GENERIC on a probe 
> the 
> > > node is fenced: LRM will try to stop that resource, but the stop will 
> return 
> > > OCF_ERR_INSTALLED, causing a fence. Right?
> > > 
> > > Depends. "stop" should return success if the service is cleanly stopped,
> > > regardless of whether binaries etc are present or not.
> > 
> > OK, but if you need the binary to determine the state, you are having a 
> problem (which is the case with some commercial software). You might argue 
> that without the binary the software isn't installed, and thus cannot be 
> running. But the you just pushed the problem to the "start" method (which 
> would fail to start the "not running" resource).
> 
> "binary not present implies software isn't running" is mostly
> legitimate, though it fails if the binary was deleted - which is
> admittedly a corner case. But ps/netcat may be better at probing than
> requiring the binary, yes. It all depends on the resource.

The problem is also a problem of supportability: If som vendor supplies a tool 
to start, stop and monitor their stuff, you are rather safe to complain about 
problems as long as you use the tool to manage the stuff. If you start writing 
your own tools, you alway have to argue who had made a mistake.

> 
> But yes, if the resource follows a deployment model where probe can't
> determine if OCF_ERR_INSTALLED/CONFIGURED are the codes to return, it
> does get pushed back to the "start" operation, which is perfectly
> fine.

I'm also quite confused that probes are done on nodes where the resource has a 
-INFINITY location score. If that wouldn't be done, I wouldn't have had all 
that trouble with fencing. I had just tested the RA on the node where the 
software is installed and expected to run.

> 
> > > Returning OCF_ERR_GENERIC for the startup probe is a bad idea, because
> > > it'll trigger the multi-node recovery logic. (Unless, of course, it is
> > > indeed up.)
> > Yes, I'm retuning "not running" when unable to determine the state, but 
> that's not 100% clean.
> 
> That is actually wrong. State unknown needs to be treated as
> failed/running (i.e., ERR_GENERIC), to avoid concurrency violations.

Yes, (as said earlier) that causes fencing quite early. Fencing only makes 
sense if the operation actually des fix the problem. In case of software errors 
it doesn't ;-) So a resource-freeze may be the better solution (once we would 
have such transitions).

> 
> > I know, the monitor can return anything, but the question is who will 
> handle the return code. I don't know. Maybe a table of methods and allowed 
> return codes would be helpful.
> 
> All the return codes are theoretically allowed, they just make more
> sense in different situations.
> 
> We all look forward to your contributions to the documentation! ;-)

Lars,

how would I be able to provide such a table if that information is (at best) 
burried in the sources with "the freedom to change anytime"? I think it's 
better to write software following a specification rather than the other way 
'round.

Ulrich


> 
> 
> Regards,
>     Lars



 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Q: OCF_ERR_INSTALLED and status/monitor

Reply via email to