Am 26.03.2011 00:10, schrieb Lars Ellenberg: ... > > Yep, "degraded" is not a state available for pacemaker. > Pacemaker cannot do much about "suboptimal". > > Pacemaker can stop, start, and promote/demote resources. > No more, no less. > > If your resources are running "suboptimal" (but working), > stopping/restarting things, in the hope that would make them > run better, likely won't add to your availability. > > Pacemaker is not a substitute for proper monitoring (nagios, whatever). > > Monitoring can page your engineer on duty (or yourself) > for things that require immediate admin intervention. > Monitoring can provide you with nice graphs, so you can detect early > which things may require strategic admin intervention. > > It is not pacemaker's job to do either. > >> Is it already there and I have made an configuration error? Or what is >> the recommended way to check the sanity of the resources controlled by >> pacemaker? > > Do you expect the cluster manager to sound the alarm beep as well, > if a disk falls out of the raid, or the battery of the BBWC on the > controler is depleted? > Or if the response time of your home page goes bad (but the status > page comes still back within the timeout)? > > What is Pacemaker expected to do? Stop everything? > > If you are Primary on DRBD, and the lower level disk has some IO error, > DRBD detaches from the local disk. The RA will notice this on the next > monitoring intervall, and adjust the master score accordingly. > Depending on overall configuration, pacemaker may then decide to migrate > resource over to the other node, or not. > > But many other resource internal problems, > replication link damage or something like that, > pacemaker has no way to magically heal things. > > > But ok, for strictly "informational purposes", conceivably, > we could add a monitoring result code to the RA spec saying > "working [slave/master], but degraded". > > That could then be presented in some obvious way in crm_mon, or even > trigger certain action scripts (which again could then page you). > > Currently, a similar effect could be achieved > by adding some sort of "supervisor resource", > which would need to be made dependent of the supervised resource, > and would "fail" if the supervised resource is not running "optimal". > > My feeling is, don't try to do everything with the same tool. > Use the best tool for the job. > Use a monitoring tool for system monitoring. > Use a cluster manager for cluster management. >
Thanks for your detailed response. I now see that external monitoring has to be implemented in addition to the cluster management. Adding a supervisor resource sounds like a hack to me. However I think that a degraded resource often means that a future promotion or migration will probably fail. And I think that this is something that should be interesting for the cluster manager. Christoph _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
