Am 26.03.2011 00:10, schrieb Lars Ellenberg:
...
>
> Yep, "degraded" is not a state available for pacemaker.
> Pacemaker cannot do much about "suboptimal".
>
> Pacemaker can stop, start, and promote/demote resources.
> No more, no less.
>
> If your resources are running "suboptimal" (but working),
> stopping/restarting things, in the hope that would make them
> run better, likely won't add to your availability.
>
> Pacemaker is not a substitute for proper monitoring (nagios, whatever).
>
> Monitoring can page your engineer on duty (or yourself)
> for things that require immediate admin intervention.
> Monitoring can provide you with nice graphs, so you can detect early
> which things may require strategic admin intervention.
>
> It is not pacemaker's job to do either.
>
>> Is it already there and I have made an configuration error? Or what is
>> the recommended way to check the sanity of the resources controlled by
>> pacemaker?
>
> Do you expect the cluster manager to sound the alarm beep as well,
> if a disk falls out of the raid, or the battery of the BBWC on the
> controler is depleted?
> Or if the response time of your home page goes bad (but the status
> page comes still back within the timeout)?
>
> What is Pacemaker expected to do?  Stop everything?
>
> If you are Primary on DRBD, and the lower level disk has some IO error,
> DRBD detaches from the local disk. The RA will notice this on the next
> monitoring intervall, and adjust the master score accordingly.
> Depending on overall configuration, pacemaker may then decide to migrate
> resource over to the other node, or not.
>
> But many other resource internal problems,
> replication link damage or something like that,
> pacemaker has no way to magically heal things.
>
>
> But ok, for strictly "informational purposes", conceivably,
> we could add a monitoring result code to the RA spec saying
> "working [slave/master], but degraded".
>
> That could then be presented in some obvious way in crm_mon, or even
> trigger certain action scripts (which again could then page you).
>
> Currently, a similar effect could be achieved
> by adding some sort of "supervisor resource",
> which would need to be made dependent of the supervised resource,
> and would "fail" if the supervised resource is not running "optimal".
>
> My feeling is, don't try to do everything with the same tool.
> Use the best tool for the job.
> Use a monitoring tool for system monitoring.
> Use a cluster manager for cluster management.
>

Thanks for your detailed response.

I now see that external monitoring has to be implemented in addition to 
the cluster management. Adding a supervisor resource sounds like a hack 
to me.

However I think that a degraded resource often means that a future 
promotion or migration will probably fail. And I think that this is 
something that should be interesting for the cluster manager.

Christoph
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to