Re: [Linux-ha-dev] Re: [Bug 957] It is a mistake for the CRM to rely on the exact return code $OCF_NOT_RUNNING versus other forms of failure

Joachim Banzhaf Fri, 18 Nov 2005 19:22:01 -0800

Hi devs,

Am Dienstag, 15. November 2005 10:54 schrieb Lars Marowsky-Bree:
> On 2005-11-14T22:20:17, [EMAIL PROTECTED] wrote:
>
> I'm pulling this discussion over to the list, because I personally find
> bugzilla really unsuited for a design discussion ;-)


agreed :-)

First, my point of view concerning RA returncodes. Maybe I just have to catch 
up on you first. Feel free to just ignore my inline comments then:


monitor
=======

there are two cases

1) continuous monitoring

healthy (rc == 0) -> OK
other (rc != 0) -> Error (could be stopped or undefined or unhealthy)
potential optimization: if it is stopped cleanly (rc == 7), the usual stop 
action following that can be avoided.

2) initial probing

stopped (rc == 7) -> spare initial stop (this is optional optimization)
healthy (rc == 0) -> spare initial start (hb could adopt already running 
service without service interruption. Can PE deal with that right now?)
other (rc !=0 && rc != 7) -> need initial stop, just in case...

problem: script could rely on other resources being started (e.g. drbd or 
filesystem) which may not be the case at that time. 


start
=====

healthy (rc == 0) -> OK (even if it was running before)
other (rc != 0) -> Error (could be stopped, unhealthy or undefined)
potential optimization: if it is stopped cleanly (rc == 7), the usual stop 
action following that can be avoided.


stop
====

stopped (rc == 0) -> OK, even if it was stopped before
other (rc != 0) -> Error, resource fencing failed, need node fencing -> have 
to reboot node

problem: script could rely on other resources being started (e.g. drbd or 
filesystem) which may not be the case at that time. 


From the above, I don't see why I would need to rely on different returncodes 
but zero and not zero, except for optimizations which "only" 
save me a stop operation. So I think the subject is right. CRM should not 
_rely_ on the returncode 7 but should take advantage of it.
For init/lsb and heartbeat 1.x-scripts where we have only status and no 
monitor the status returncode 3 should be mapped to 7 and running should be 
interpreted as healthy. I think thats the best one can do with them.
 
I have far more problems with a failed stop which admittedly may be triggered 
by a monitor not returning 7:
Most stop operations out in the field return rc !=0 even if resources are 
stopped for sure (e.g binary or config not found). This could easily happen, 
if the service is on a drbd device or filesystem which is not yet available 
during initial probing. In these cases the returncodes would even be lsb 
compliant. Rebooting however is inacceptable in these cases. So at least 
while probing and for legacy scripts it would make sense to allow for the 
following LSB returncodes to mean success:
5       program is not installed
6       program is not configured
and maybe (although that is not an LSB compliant returncode for stop):
7       program is not running
I have not checked whether this is done right now.
Maybe a resource attribute in the cib could help, that tells crm that stop and 
monitor are unreliable for this resource while probing?


> Alan, let me preface this with: Yes, I agree it would be great if we
> could do away with it. If we find a way of doing so, I'm happy. Right
> now, I think we can't. So, I'll try to explain why, but will actually be
> quite happy if you see a better solution. Ok?
>
> This distinction is required for probing; I don't see how probing can
> work without it.

It just works less efficient. So crm should not rely on it but make use of it.

> > It is unlikely that most resource agents will bother to distinguish
> > the cause of a resource not really running right - and they'll likely
> > return a single return code for all non-success cases.
>
> Yes, I can see that, because it seems to be the most easy way.
>
> During normal operation, we could indeed treat this the same. We're
> monitoring it on node A, it goes away, so we throw in a "stop" before we
> start it again (wherever). That's ok. (A)

agreed.

> But, during probing, we need to know where it is _started_, either
> broken or healthy.

if we are not sure it is stopped, stop it. No big deal. Right?

> _If_ it is indeed started on more than one node, it's rather likely that
> it will not be healthy on all of them. (Because, heck, it's probably
> operating on corrupted data ;-) So, we are unlikely to get a SUCCESS
> exit code.
>
> So, if treated all error codes the same for that case, we'd have no way
> to distinguish 'kaputt' from 'stopped / not running'.
>
> We'd always have to treat "kaputt" as "not running" - and fail to detect
> resources which are active where they shouldn't be but broken. (B)

I think this is not acceptable. Unhealthy resources should be stopped.

> (Or treat "not running" as "kaputt", assuming that resources are running
> but broken everywhere they are stopped. This is quite obviously even
> less acceptable ;-)

Isn't this just optimization? Stop on stopped resources should be ok (it is, 
by definition).

> That would, IMHO, make probing rather less useful; because it detects a
> _severe_ potentially corruptive error case which either an admin
> mistake, a split-brain or a software error might have caused.

Sad, but true.
But to remedy that, I think probing has to be seen as something different from 
monitoring. Monitoring takes place when all resource constraints are met and 
the resource is expected to run 100% healthy. Probing takes place in 
potentially bogus, even non existing environments and aims at detecting 
resources as 100% stopped. We have black and white and shades of grey. One 
time grey is counted as black and the other it counts as white.

> I think this is quite useful. (And, I don't think we've said anything
> yet which invalidates the discussions we've had about this one at the
> OCF meetings.)
>
> But, I'm quite open to making it useful _and_ easier to get right. ;-)

The monitor operation with specific returncodes is enough in an ideal world.
To make it more obvious for those RA programmers, who don't like reading 
specs, two distinct operations (e.g. is-healthy and need-fencing) which can 
only return true or false would be more self-describing. But I don't like 
that. Good documentation, including the reasoning behind the returncodes and 
consequences of them being wrong combined with treating unhealthy (kaputt) as 
not stopped so stop it and complain loud about it while probing is more of my 
taste.

> We could make the (B) option the default for anything but OCF RAs, and
> optional for those if the scripts are wrong?
>
> Some further comments, which I think might help:
> > As an example, maybe it died uncleanly, and left some semaphores
> > hanging around that should have been cleaned up, or one of 100 other
> > possible symptoms.
>
> Actually, if they have semaphores laying around, that by itself isn't a
> problem. It is untidy, but not a correctness problem, I should clarify.
>
> This could be fixed by (A); just stop it if we expected it to be
> running, to be sure it was cleaned up.

Isn't that done already? I think so and I would expect that. Having monitor 
fail does not mean it is stopped.

> > Most will either try and talk to the service and get an error
> > response, or they'll look to see if it's running before asking.  If
> > they do the former, they have no idea if it's running but not working,
> > or if it was never started or cleanly stopped.
> >
> > Your comments may make sense, but the evidence suggests it doesn't
> > work.  It's easy to say that everyone should do it right, but it's
> > much harder to make everyone do it right.
>
> I'm not sure about this one. The scripts will also wreak havoc if
> start/stop are not idempotent, if stop leaves behind artifacts, and for
> a number of other reasons.
>
> This could be helped by (C1) documenting this requirement more clearly,
> and slightly relaxing of the rules for this case, though:
> "OCF_NOT_RUNNING may be returned for resources which are in a state in
> which they do not affect the resource on other nodes." (C2)
>
> While it would still be preferential if they only returned it if indeed
> it was cleanly stopped, it would, again, only be "untidy" but not a
> correctness issue.

In general, I think it should work with legacy scripts - just in a less 
efficient way. For OCF RA's which are specifically designed for this case, I 
guess it is acceptable when they break if they don't conform to the standard 
and this is necessary for important features to work at all.

> > If you want to know if something is running, that's exactly what the
> > status operation is supposed to do.  If it's the special case of on
> > startup, then that would seem like a very good way to make this
> > dependence on exact failure return codes go away, while remaining
> > perfectly LSB-compliant.
>
> About this one I'm not at all sure.

Me too. I dont think status can do better than monitor in that case.
It is usually not designed to make a distinction between surely stopped and 
not healthy. And then, it is not a required op for ocf ra's.

> If the script can indeed tell the difference in its status operation, it
> should be able to tell us in monitor.
>
> If they can't get the subset right, I'm personally not trusting them to
> get the superset done any better ;-)
>
> Even status has the "OK" / "dead|unknown" / "stopped" distinction.
>
> Looking at the Linux FailSafe history (or what I can still remember),
> yes, this was a problem, people got the exclusive action wrong some of
> the time. However, because it's an error where the cluster will
> immediately bitch on startup, it usually got fixed very quickly ;-)
>
> (Small tangent) Looking at the verifyallidle implementation in the
> ResourceManager, this wasn't a problem for hb1, because all it cared for
> was "running, in whatever state" and "not running". Now we've introduced
> the concept of "service health", I think we need the distinction.
>
> Am I making sense?
>
> I think implementing (A) is indeed strongly recommended.

I think this is done already.

> (B) might be helpful. (C1) is certainly a good idea. (C2), maybe.

B: I don't agree.
C: I am with you here.

So, my conclusion is, having a distinct returncode 7 is good. It helps 
avoiding unnecessary stop operations but is not - and should not be - 
strictly required.
_Much_ more important is handling stop failures more differenciated than just 
OK/not OK.

>
> Sincerely,
>     Lars Marowsky-Brée <[EMAIL PROTECTED]>

regards,

Joachim Banzhaf
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Re: [Bug 957] It is a mistake for the CRM to rely on the exact return code $OCF_NOT_RUNNING versus other forms of failure

Reply via email to