Hi, On Tue, Jun 15, 2010 at 02:25:51PM -0600, Dan Urist wrote: > On Tue, 15 Jun 2010 22:08:37 +0200 > Dejan Muhamedagic <deja...@fastmail.fm> wrote: > > > Hi, > > > > On Tue, Jun 15, 2010 at 01:15:08PM -0600, Dan Urist wrote: > > > I've recently had exactly the same thing happen. One (highly > > > kludgey!) solution I've considered is hacking a custom version of > > > the stonith IPMI agent that would check whether the node was at all > > > reachable following a stonith failure via any of the cluster > > > interfaces reported by cl_status (I have redundant network links), > > > and then return true (i.e. pretend the stonith succeeded) if it > > > isn't. Since this is basically the logic I would use if I were > > > trying to debug the issue remotely, I don't see that this would be > > > any worse. > > > > > > Besides the obvious (potential, however unlikely, for split-brain), > > > is there any reason this approach wouldn't work? > > > > That sounds like a reason good enough to me :) If you can't reach > > the host, you cannot know its state. > > > > But in my case, if the live node can't reach the suspect node via its > public network interface, its private bonded interface, or its IPMI > card (I've added a ping test for that, to determine that it's actually > unreachable rather than just failing), it seems pretty reasonable for > me to assume it's really dead at that point.
Perhaps somebody just pulled the network cables. I understand that it's not unheard of. Thanks, Dejan > > Thanks, > > > > Dejan > > > > > On Tue, 15 Jun 2010 19:55:49 +0200 > > > Bernd Schubert <bs_li...@aakef.fastmail.fm> wrote: > > > > > > > Hello Diane, > > > > > > > > the problem is that pacemaker is not allowed to take over > > > > resources until stonith succeeds, as it simply does not know > > > > about the state of the other server. Lets assume the other node > > > > would still be up and running, would have mounted a shared > > > > storage device an would write to it, but would respond to network > > > > anymore. If pacemaker would now mount this device again, you > > > > would get data corruption. To protect you against that, it > > > > requires that stonith succeeds, or that you manually solve that > > > > problem. > > > > > > > > The only automatic solution would be a more reliable stonith > > > > device, e.g. IPMI with an extra power supply for the IPMI card or > > > > a PDU. > > > > > > > > Cheers, > > > > Bernd > > > > > > > > On Tuesday 15 June 2010, Schaefer, Diane E wrote: > > > > > Thanks for the idea. Is there any way to automatically recover > > > > > resources without manual intervention? > > > > > > > > > > Diane > > > > > > > > > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE > > > > > PROPRIETARY MATERIAL and is thus for use only by the intended > > > > > recipient. If you received this in error, please contact the > > > > > sender and delete the e-mail and its attachments from all > > > > > computers. > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Bernd Schubert [mailto:bs_li...@aakef.fastmail.fm] > > > > > Sent: Tuesday, June 15, 2010 1:39 PM > > > > > To: pacemaker@oss.clusterlabs.org > > > > > Cc: Schaefer, Diane E > > > > > Subject: Re: [Pacemaker] abrupt power failure problem > > > > > > > > > > On Tuesday 15 June 2010, Schaefer, Diane E wrote: > > > > > > Hi, > > > > > > We are having trouble with our two node cluster after one > > > > > > node experiences an abrupt power failure. The resources do > > > > > > not seem to start on the remaining node (ie DRBD resources do > > > > > > not promote to master). In the log we notice: > > > > > > > > > > > > Jan 8 02:12:27 qpr4 stonithd: [6622]: info: external_run_cmd: > > > > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' > > > > > > returned 256 Jan 8 02:12:27 qpr4 stonithd: [6622]: CRIT: > > > > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc > > > > > > 256 Jan 8 02:12:27 qpr4 stonithd: [5854]: info: failed to > > > > > > STONITH node qpr3 with local device stonith0 (exitcode 5), > > > > > > gonna try the next local device Jan 8 02:12:27 qpr4 > > > > > > stonithd: [5854]: info: we can't manage qpr3, broadcast > > > > > > request to other nodes Jan 8 02:13:27 qpr4 stonithd: [5854]: > > > > > > ERROR: Failed to STONITH the node qpr3: optype=RESET, > > > > > > op_result=TIMEOUT > > > > > > > > > > > > Jan 8 02:13:27 qpr4 stonithd: [6763]: info: external_run_cmd: > > > > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' > > > > > > returned 256 Jan 8 02:13:27 qpr4 stonithd: [6763]: CRIT: > > > > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc > > > > > > 256 Jan 8 02:13:27 qpr4 stonithd: [5854]: info: failed to > > > > > > STONITH node qpr3 with local device stonith0 (exitcode 5), > > > > > > gonna try the next local device Jan 8 02:13:27 qpr4 > > > > > > stonithd: [5854]: info: we can't manage qpr3, broadcast > > > > > > request to other nodes Jan 8 02:14:27 qpr4 stonithd: [5854]: > > > > > > ERROR: Failed to STONITH the node qpr3: optype=RESET, > > > > > > op_result=TIMEOUT > > > > > > > > > > Without looking at your hb_report, this already looks pretty > > > > > clear > > > > > - this node tries to reset the other node using IPMI and that > > > > > fails, of course, as the node to be reset is powered off. > > > > > When we had that problem in the past, we simply temporarily > > > > > removed the failed node from the pacemaker configuration: crm > > > > > node remove <node-name> > > > > > > > > > > > > > > > Cheers, > > > > > Bernd > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > Project Home: http://www.clusterlabs.org > > > > Getting started: > > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > > > > > > > > > -- > > > Dan Urist > > > dur...@ucar.edu > > > 303-497-2459 > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > -- > Dan Urist > dur...@ucar.edu > 303-497-2459 > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker