On Sat, Mar 26, 2011 at 12:10 AM, Lars Ellenberg <[email protected]> wrote: > On Fri, Mar 25, 2011 at 06:18:07PM +0100, Christoph Bartoschek wrote: >> Hi, >> >> we experiment with DRBD and pacemaker and see several times that the >> DRBD part is degraded (One node is outdated or diskless or something >> similar) but crm_mon just reports that the DRBD resource runs as master >> and slave on the nodes. >> >> There is no indication that the resource is not in its optimal mode of >> operation. >> >> For me it seems as if pacemaker knows only the states: running, stopped, >> failed. >> >> I am missing the state: running degraded or suboptimal. > > Yep, "degraded" is not a state available for pacemaker. > Pacemaker cannot do much about "suboptimal".
I wonder what it would take to change that. I suspect either a crystal ball or way too much knowledge of drbd internals. > > Pacemaker can stop, start, and promote/demote resources. > No more, no less. > > If your resources are running "suboptimal" (but working), > stopping/restarting things, in the hope that would make them > run better, likely won't add to your availability. > > Pacemaker is not a substitute for proper monitoring (nagios, whatever). > > Monitoring can page your engineer on duty (or yourself) > for things that require immediate admin intervention. > Monitoring can provide you with nice graphs, so you can detect early > which things may require strategic admin intervention. > > It is not pacemaker's job to do either. > >> Is it already there and I have made an configuration error? Or what is >> the recommended way to check the sanity of the resources controlled by >> pacemaker? > > Do you expect the cluster manager to sound the alarm beep as well, > if a disk falls out of the raid, or the battery of the BBWC on the > controler is depleted? > Or if the response time of your home page goes bad (but the status > page comes still back within the timeout)? > > What is Pacemaker expected to do? Stop everything? > > If you are Primary on DRBD, and the lower level disk has some IO error, > DRBD detaches from the local disk. The RA will notice this on the next > monitoring intervall, and adjust the master score accordingly. > Depending on overall configuration, pacemaker may then decide to migrate > resource over to the other node, or not. > > But many other resource internal problems, > replication link damage or something like that, > pacemaker has no way to magically heal things. > > > But ok, for strictly "informational purposes", conceivably, > we could add a monitoring result code to the RA spec saying > "working [slave/master], but degraded". > > That could then be presented in some obvious way in crm_mon, or even > trigger certain action scripts (which again could then page you). > > Currently, a similar effect could be achieved > by adding some sort of "supervisor resource", > which would need to be made dependent of the supervised resource, > and would "fail" if the supervised resource is not running "optimal". > > My feeling is, don't try to do everything with the same tool. > Use the best tool for the job. > Use a monitoring tool for system monitoring. > Use a cluster manager for cluster management. > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
