Re: [Linux-HA] DRBD and pacemaker interaction

Lars Ellenberg Tue, 05 Apr 2011 01:30:25 -0700

On Tue, Apr 05, 2011 at 09:17:14AM +0200, Andrew Beekhof wrote:
> On Mon, Apr 4, 2011 at 10:14 PM, Lars Ellenberg
> <[email protected]> wrote:
> > On Mon, Apr 04, 2011 at 09:43:27AM +0200, Andrew Beekhof wrote:
> >> >>>>>>>> I am missing the state: running degraded or suboptimal.
> >> >>>>>>>
> >> >>>>>>> Yep, "degraded" is not a state available for pacemaker.
> >> >>>>>>> Pacemaker cannot do much about "suboptimal".
> >
> >
> >> Maybe we need to add OCF_RUNNING_BUT_DEGRADED to the OCF spec (and the PE).
> >
> > And, of course, OCF_MASTER_BUT_ONLY_ONE_FAILURE_AWAY_FROM_COMPLETE_DATA_LOSS
> 
> Feeling quite alright there?


Yes. Sorry. Just wanted to point out that you
need extra exit codes for both RUNNING and MASTER.
And a degraded Master, thinking replication,
happens to be just one failure away from data non-availability.

> The intention was that the PE would treat it the same as OCF_RUNNING -
> hence the name.
> It would exist purely to give admin tools the ability to provide
> additional feedback to users - like you outlined above.
> 
> Essentially it would be a way for the RA to say "Something isn't
> right, but you (ie. pacemaker) shouldn't do anything about it other
> than let a human know".
> Anything more complex is WAY out of scope.

Exactly.

And I'd rather not go the exit code way, but some "degraded"
attribute/score way, similar to the "master" attribute/score thing.

Because that is readily available, and would need at most changes to
crm_mon, no need to change the PE, or double check resource agents for
random exit codes that previously would have been
unknown exit code -> generic error, and now happen to be defined
and interpreted as non-error.

Though arguably, any RA that currently returns an exit code outside of
the defined exit code range is broken, thus we are allowed to not care?

An other problem with the exit codes is about backwards compatibility:
you no longer can use the same RA on Pacemaker with knowledge about
the additional exit codes and on Pacemaker versions older than that.

A crm_degraded command that would set the "degraded" attribute,
similar to what crm_master does for the master attribute
would avoid that.

crm_mon (or any command that gives "feedback to the users")
can trigger on "present and not empty", and just display the content.

So it even gives the RA means to give details about the kind of
degradedness. DRBD could say crm_degraded "replication link down"
"lower level disk failed" or whatever.

Question is: should that degraded attribute (for clones, or ms)
follow/be attached to the instance id,
or just the resource id (without the trailing :<instance number>)?

I'd suggest it should not include the :<instance number>.

At least for a degraded DRBD that would not make much sense.
It might make sense in the general case,
if individual clone instances can be degraded independently.
Though I cannot think of a real example of such a case.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD and pacemaker interaction

Reply via email to