"Andrew Beekhof" <[EMAIL PROTECTED]> wrote on 2008/05/23 06:24:35:
> On Thu, May 22, 2008 at 7:31 PM, <[EMAIL PROTECTED]> wrote: > > > > Thanks for the response. > > > > Yes from the logs in a 4 second interval while the nic was disconnected lrmd > > attempts to > > start the STONITH resource twice. > > > > A couple of observations/questions : > > > > - Where is this behaviour/logic documented ? > you mean the bit about start failures being fatal? > hmmm - good question - I'm not sure we ever did document that, despite > it being part of the design from day 1. > > I've had great difficulty > > finding information > > on ha-linux. I eventually stumbled across the pacemaker site butthis just > > raises more > > questions on what I should be running. ie If I'm looking at implementing a > > drbd/heartbeat > > system in the next 3 months should I stick with the 2.1.3 heartbeat > > packages or go > > to pacemaker/heartbeat from the start. > pacemaker/heartbeat from the start I will look at downloading the binaries and upgrading. > 2.1.3 was the last combined release before the CRM was split off to > become Pacemaker. > all CRM code has since been completely removed from the Heartbeat code-base. > > - Why shouldn't we be able to change this behaviour with the > > on_fail="restart", > basically, because "starts are considered special" > however this is likely to become configurable now that that failures > (including failed starts) can be timed out > > maybe > > also with the "interval" parameter ? > > Wouldn't it be desirable to have the cluster recover without user > > intervention ? > "how?" > we tried starting it everywhere, it failed. > retrying forever, causing extra load and potentially downtime for > other resources isn't the smartest thing to do. By using on_fail="restart" and some reasonable interval value would this really put a load on the system ? Another option for a stonith resource would be to not stop the resource if the monitor/status failed - just write an error message and update the Failed Actions. The first successful monitor/status would clear the failed actions. I connected to the pacemaker site and have read through some of the documentation including Initial Configuration and Pacemaker Configuration Explained. Two questions : - In Initial Configuration under Enabling Pacemaker you specify using "crm respawn" rather then "crm yes" if you plan to enable STONITH. Why ? - In Pacemaker Configuration Explained under "How Should the Configuration be Updated" you talk about using XML editors. Do you have any recommendations ? Thanks again > so instead we wait for the admin to clean up the problem and tell the > cluster its ok to continue. > > In our case eth0 is connected to a switch which is only used for > > connecting > > to the ilo cards - I would far rather have the cluster recover then > > receive a call at 3am. > > - After some trial I found that to cleanup the resource I needed to specify > > : > > crm_resource -C -H hatest1 -r CL_stonith_node02:0 and > > crm_resource -C -H hatest1 -r CL_stonith_node02:1 > > - shouldn't I just have to do : crm_resource -C -r CL_stonith_node02 > > - from hb_gui however I can just click on CL_stonithset_node02 and > > select "Cleanup Resource" and it does work. The problem in the gui > > is that the "Failed actions" as shown by crm_mon are not present. > > from the gui > > - also the man page for crm_resource has : > > --cleanup, -C > > Delete a resource from the LRM. > > Requires: -r. Optional: -H > I believe that the documentation is incorrect. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
