On Sat, Mar 26, 2011 at 12:10 AM, Lars Ellenberg
<[email protected]> wrote:
> On Fri, Mar 25, 2011 at 06:18:07PM +0100, Christoph Bartoschek wrote:
>> Hi,
>>
>> we experiment with DRBD and pacemaker and see several times that the
>> DRBD part is degraded (One node is outdated or diskless or something
>> similar) but crm_mon just reports that the DRBD resource runs as master
>> and slave on the nodes.
>>
>> There is no indication that the resource is not in its optimal mode of
>> operation.
>>
>> For me it seems as if pacemaker knows only the states: running, stopped,
>> failed.
>>
>> I am missing the state: running degraded or suboptimal.
>
> Yep, "degraded" is not a state available for pacemaker.
> Pacemaker cannot do much about "suboptimal".

I wonder what it would take to change that.  I suspect either a
crystal ball or way too much knowledge of drbd internals.

>
> Pacemaker can stop, start, and promote/demote resources.
> No more, no less.
>
> If your resources are running "suboptimal" (but working),
> stopping/restarting things, in the hope that would make them
> run better, likely won't add to your availability.
>
> Pacemaker is not a substitute for proper monitoring (nagios, whatever).
>
> Monitoring can page your engineer on duty (or yourself)
> for things that require immediate admin intervention.
> Monitoring can provide you with nice graphs, so you can detect early
> which things may require strategic admin intervention.
>
> It is not pacemaker's job to do either.
>
>> Is it already there and I have made an configuration error? Or what is
>> the recommended way to check the sanity of the resources controlled by
>> pacemaker?
>
> Do you expect the cluster manager to sound the alarm beep as well,
> if a disk falls out of the raid, or the battery of the BBWC on the
> controler is depleted?
> Or if the response time of your home page goes bad (but the status
> page comes still back within the timeout)?
>
> What is Pacemaker expected to do?  Stop everything?
>
> If you are Primary on DRBD, and the lower level disk has some IO error,
> DRBD detaches from the local disk. The RA will notice this on the next
> monitoring intervall, and adjust the master score accordingly.
> Depending on overall configuration, pacemaker may then decide to migrate
> resource over to the other node, or not.
>
> But many other resource internal problems,
> replication link damage or something like that,
> pacemaker has no way to magically heal things.
>
>
> But ok, for strictly "informational purposes", conceivably,
> we could add a monitoring result code to the RA spec saying
> "working [slave/master], but degraded".
>
> That could then be presented in some obvious way in crm_mon, or even
> trigger certain action scripts (which again could then page you).
>
> Currently, a similar effect could be achieved
> by adding some sort of "supervisor resource",
> which would need to be made dependent of the supervised resource,
> and would "fail" if the supervised resource is not running "optimal".
>
> My feeling is, don't try to do everything with the same tool.
> Use the best tool for the job.
> Use a monitoring tool for system monitoring.
> Use a cluster manager for cluster management.
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to