[Linux-ha-dev] Re: [Bug 957] It is a mistake for the CRM to rely on the exact return code $OCF_NOT_RUNNING versus other forms of failure

Lars Marowsky-Bree Tue, 15 Nov 2005 01:55:19 -0800

On 2005-11-14T22:20:17, [EMAIL PROTECTED] wrote:

I'm pulling this discussion over to the list, because I personally find
bugzilla really unsuited for a design discussion ;-)


Alan, let me preface this with: Yes, I agree it would be great if we
could do away with it. If we find a way of doing so, I'm happy. Right
now, I think we can't. So, I'll try to explain why, but will actually be
quite happy if you see a better solution. Ok?

This distinction is required for probing; I don't see how probing can
work without it.

> It is unlikely that most resource agents will bother to distinguish
> the cause of a resource not really running right - and they'll likely
> return a single return code for all non-success cases.

Yes, I can see that, because it seems to be the most easy way.

During normal operation, we could indeed treat this the same. We're
monitoring it on node A, it goes away, so we throw in a "stop" before we
start it again (wherever). That's ok. (A)

But, during probing, we need to know where it is _started_, either
broken or healthy.

_If_ it is indeed started on more than one node, it's rather likely that
it will not be healthy on all of them. (Because, heck, it's probably
operating on corrupted data ;-) So, we are unlikely to get a SUCCESS
exit code.

So, if treated all error codes the same for that case, we'd have no way
to distinguish 'kaputt' from 'stopped / not running'.

We'd always have to treat "kaputt" as "not running" - and fail to detect
resources which are active where they shouldn't be but broken. (B)

(Or treat "not running" as "kaputt", assuming that resources are running
but broken everywhere they are stopped. This is quite obviously even
less acceptable ;-)

That would, IMHO, make probing rather less useful; because it detects a
_severe_ potentially corruptive error case which either an admin
mistake, a split-brain or a software error might have caused.

I think this is quite useful. (And, I don't think we've said anything
yet which invalidates the discussions we've had about this one at the
OCF meetings.)

But, I'm quite open to making it useful _and_ easier to get right. ;-)

We could make the (B) option the default for anything but OCF RAs, and
optional for those if the scripts are wrong?

Some further comments, which I think might help:

> As an example, maybe it died uncleanly, and left some semaphores
> hanging around that should have been cleaned up, or one of 100 other
> possible symptoms.

Actually, if they have semaphores laying around, that by itself isn't a
problem. It is untidy, but not a correctness problem, I should clarify.

This could be fixed by (A); just stop it if we expected it to be
running, to be sure it was cleaned up.

> Most will either try and talk to the service and get an error
> response, or they'll look to see if it's running before asking.  If
> they do the former, they have no idea if it's running but not working,
> or if it was never started or cleanly stopped.
> 
> Your comments may make sense, but the evidence suggests it doesn't
> work.  It's easy to say that everyone should do it right, but it's
> much harder to make everyone do it right.

I'm not sure about this one. The scripts will also wreak havoc if
start/stop are not idempotent, if stop leaves behind artifacts, and for
a number of other reasons.

This could be helped by (C1) documenting this requirement more clearly,
and slightly relaxing of the rules for this case, though:
"OCF_NOT_RUNNING may be returned for resources which are in a state in
which they do not affect the resource on other nodes." (C2)

While it would still be preferential if they only returned it if indeed
it was cleanly stopped, it would, again, only be "untidy" but not a
correctness issue.

> If you want to know if something is running, that's exactly what the
> status operation is supposed to do.  If it's the special case of on
> startup, then that would seem like a very good way to make this
> dependence on exact failure return codes go away, while remaining
> perfectly LSB-compliant.

About this one I'm not at all sure.

If the script can indeed tell the difference in its status operation, it
should be able to tell us in monitor.

If they can't get the subset right, I'm personally not trusting them to
get the superset done any better ;-)

Even status has the "OK" / "dead|unknown" / "stopped" distinction.

Looking at the Linux FailSafe history (or what I can still remember),
yes, this was a problem, people got the exclusive action wrong some of
the time. However, because it's an error where the cluster will
immediately bitch on startup, it usually got fixed very quickly ;-)

(Small tangent) Looking at the verifyallidle implementation in the
ResourceManager, this wasn't a problem for hb1, because all it cared for
was "running, in whatever state" and "not running". Now we've introduced
the concept of "service health", I think we need the distinction.

Am I making sense?

I think implementing (A) is indeed strongly recommended. 

(B) might be helpful. (C1) is certainly a good idea. (C2), maybe.


Sincerely,
    Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business     -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] Re: [Bug 957] It is a mistake for the CRM to rely on the exact return code $OCF_NOT_RUNNING versus other forms of failure

Reply via email to