Re: [Linux-ha-dev] Re: [Bug 957] It is a mistake for the CRM to rely on the exact return code $OCF_NOT_RUNNING versus other forms of failure

Andrew Beekhof Mon, 21 Nov 2005 17:32:21 -0800

On 11/15/05, Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2005-11-14T22:20:17, [EMAIL PROTECTED] wrote:
>
> I'm pulling this discussion over to the list, because I personally find
> bugzilla really unsuited for a design discussion ;-)
>
> Alan, let me preface this with: Yes, I agree it would be great if we
> could do away with it. If we find a way of doing so, I'm happy. Right
> now, I think we can't. So, I'll try to explain why, but will actually be
> quite happy if you see a better solution. Ok?
>
> This distinction is required for probing; I don't see how probing can
> work without it.
>
> > It is unlikely that most resource agents will bother to distinguish
> > the cause of a resource not really running right - and they'll likely
> > return a single return code for all non-success cases.
>
> Yes, I can see that, because it seems to be the most easy way.
>
> During normal operation, we could indeed treat this the same. We're
> monitoring it on node A, it goes away, so we throw in a "stop" before we
> start it again (wherever). That's ok. (A)
>
> But, during probing, we need to know where it is _started_, either
> broken or healthy.
>
> _If_ it is indeed started on more than one node, it's rather likely that
> it will not be healthy on all of them. (Because, heck, it's probably
> operating on corrupted data ;-) So, we are unlikely to get a SUCCESS
> exit code.
>
> So, if treated all error codes the same for that case, we'd have no way
> to distinguish 'kaputt' from 'stopped / not running'.
>
> We'd always have to treat "kaputt" as "not running" - and fail to detect
> resources which are active where they shouldn't be but broken. (B)
>
> (Or treat "not running" as "kaputt", assuming that resources are running
> but broken everywhere they are stopped. This is quite obviously even
> less acceptable ;-)
>
> That would, IMHO, make probing rather less useful; because it detects a
> _severe_ potentially corruptive error case which either an admin
> mistake, a split-brain or a software error might have caused.
>
> I think this is quite useful. (And, I don't think we've said anything
> yet which invalidates the discussions we've had about this one at the
> OCF meetings.)
>
> But, I'm quite open to making it useful _and_ easier to get right. ;-)
>
> We could make the (B) option the default for anything but OCF RAs, and
> optional for those if the scripts are wrong?
>
> Some further comments, which I think might help:
>
> > As an example, maybe it died uncleanly, and left some semaphores
> > hanging around that should have been cleaned up, or one of 100 other
> > possible symptoms.
>
> Actually, if they have semaphores laying around, that by itself isn't a
> problem. It is untidy, but not a correctness problem, I should clarify.
>
> This could be fixed by (A); just stop it if we expected it to be
> running, to be sure it was cleaned up.
>
> > Most will either try and talk to the service and get an error
> > response, or they'll look to see if it's running before asking.  If
> > they do the former, they have no idea if it's running but not working,
> > or if it was never started or cleanly stopped.
> >
> > Your comments may make sense, but the evidence suggests it doesn't
> > work.  It's easy to say that everyone should do it right, but it's
> > much harder to make everyone do it right.
>
> I'm not sure about this one. The scripts will also wreak havoc if
> start/stop are not idempotent, if stop leaves behind artifacts, and for
> a number of other reasons.
>
> This could be helped by (C1) documenting this requirement more clearly,
> and slightly relaxing of the rules for this case, though:
> "OCF_NOT_RUNNING may be returned for resources which are in a state in
> which they do not affect the resource on other nodes." (C2)
>
> While it would still be preferential if they only returned it if indeed
> it was cleanly stopped, it would, again, only be "untidy" but not a
> correctness issue.
>
> > If you want to know if something is running, that's exactly what the
> > status operation is supposed to do.  If it's the special case of on
> > startup, then that would seem like a very good way to make this
> > dependence on exact failure return codes go away, while remaining
> > perfectly LSB-compliant.
>
> About this one I'm not at all sure.
>
> If the script can indeed tell the difference in its status operation, it
> should be able to tell us in monitor.
>
> If they can't get the subset right, I'm personally not trusting them to
> get the superset done any better ;-)
>
> Even status has the "OK" / "dead|unknown" / "stopped" distinction.
>
> Looking at the Linux FailSafe history (or what I can still remember),
> yes, this was a problem, people got the exclusive action wrong some of
> the time. However, because it's an error where the cluster will
> immediately bitch on startup, it usually got fixed very quickly ;-)
>
> (Small tangent) Looking at the verifyallidle implementation in the
> ResourceManager, this wasn't a problem for hb1, because all it cared for
> was "running, in whatever state" and "not running". Now we've introduced
> the concept of "service health", I think we need the distinction.
>
> Am I making sense?
>
> I think implementing (A) is indeed strongly recommended.
>
> (B) might be helpful. (C1) is certainly a good idea. (C2), maybe.
>


This whole conversation makes no sense to me.  If people don't follow
the spec things will not work as they expect... where is the mystery?

The spec very clearly states:

3.6.1. All operations

0       No error, action succeeded completely
1       generic or unspecified error (current practice)
        The "monitor" operation shall return this for a crashed, hung or
        otherwise non-functional resource.

The only sane interpretation is therefore to treat the resource failed.

If people aren't following the spec, then thats an issue we need to
address with education (or changes to the spec) not by requiring the
CRM to have a crystal ball.


Also, for the record, option A is not allowed as it would
bypass/conflict with the "multiple active" recovery policies
configured in the CIB and carried out in the PE.
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Re: [Bug 957] It is a mistake for the CRM to rely on the exact return code $OCF_NOT_RUNNING versus other forms of failure

Reply via email to