24.03.2011 09:47, Andrew Beekhof wrote: > On Wed, Mar 23, 2011 at 1:56 PM, Vladislav Bogdanov > <bub...@hoster-ok.com> wrote: >> Hi Andrew, >> >> 23.12.2010 14:14, Andrew Beekhof wrote: >> ... >>>> Especially I need to understand how pacemaker integrates with cman's >>>> fencing/dlm subsystem: >>>> *) Do I need to configure fencing in both cman and pacemaker? >>> >>> No. Just in Pacemaker. >>> fenced spins waiting for Pacemaker to make an API call that tells it >>> that fencing completed, at which point the dlm can continue. >> >> It doesn't seem to be enough even with c6a01b02950b: > > With just that patch or everything before it too?
Everything. > >> When I killall -9 corosync on one node (vd01-b, cman id 2) which by the >> chance was a DC, the I have following in log on will-be-new-DC (vd01-d) >> which again by chance run stonith resource for vd01-b (only relevant log >> lines): >> ============ >> Mar 23 10:08:49 vd01-d corosync[1630]: [TOTEM ] A processor failed, >> forming new configuration. >> Mar 23 10:09:01 vd01-d kernel: dlm: closing connection to node 2 >> Mar 23 10:09:01 vd01-d crmd: [1875]: info: cman_event_callback: >> Membership 1582268: quorum retained >> Mar 23 10:09:01 vd01-d crmd: [1875]: info: ais_status_callback: status: >> vd01-b is now lost (was member) >> Mar 23 10:09:01 vd01-d crmd: [1875]: info: crm_update_peer: Node vd01-b: >> id=2 state=lost (new) addr=(null) votes=0 born=1582212 seen=1582264 >> proc=00000000000000000000000000111312 >> Mar 23 10:09:01 vd01-d corosync[1630]: [CLM ] Members Left: >> Mar 23 10:09:01 vd01-d crmd: [1875]: WARN: check_dead_member: Our DC >> node (vd01-b) left the cluster >> Mar 23 10:09:01 vd01-d corosync[1630]: [CLM ] #011r(0) ip(10.5.4.65) >> Mar 23 10:09:01 vd01-d crmd: [1875]: info: send_ais_text: Peer >> overloaded or membership in flux: Re-sending message (Attempt 1 of 20) >> Mar 23 10:09:01 vd01-d corosync[1630]: [QUORUM] Members[15]: 1 3 4 5 6 >> 7 8 9 10 11 12 13 14 15 16 >> Mar 23 10:09:02 vd01-d corosync[1630]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Mar 23 10:09:02 vd01-d fenced[1688]: fencing deferred to vd01-a >> Mar 23 10:09:02 vd01-d crmd: [1875]: info: update_dc: Unset DC vd01-b >> ============ >> >> At this time fenced (on vd01-a which has cman id 1 and is a fencing >> domain master) tries to kill that node but fails: >> ============ >> Mar 23 10:09:02 vd01-a fenced[1748]: fencing node vd01-b >> Mar 23 10:09:02 vd01-a fenced[1748]: fence vd01-b dev 0.0 agent none >> result: error no method >> Mar 23 10:09:02 vd01-a fenced[1748]: fence vd01-b failed >> Mar 23 10:09:05 vd01-a fenced[1748]: fencing node vd01-b >> Mar 23 10:09:05 vd01-a fenced[1748]: fence vd01-b dev 0.0 agent none >> result: error no method >> Mar 23 10:09:05 vd01-a fenced[1748]: fence vd01-b failed >> Mar 23 10:09:08 vd01-a fenced[1748]: fencing node vd01-b >> Mar 23 10:09:08 vd01-a fenced[1748]: fence vd01-b dev 0.0 agent none >> result: error no method >> Mar 23 10:09:08 vd01-a fenced[1748]: fence vd01-b failed >> ============ >> All DLM-related staff is blocked. >> >> After 1 minute vd01-d takes over DC role. >> ============ >> Mar 23 10:10:03 vd01-d crmd: [1875]: info: update_dc: Set DC to vd01-d >> (3.0.5) >> ============ >> After that all monitoring operations on resources which depend on DLM >> (LVM, GFS) fail with timeout, all dependent resources are then stopped, >> so cluster stops to be highly available. >> >> And only almost one more minute later pacemaker decides to stonith vd01-b: >> ============ >> Mar 23 10:10:54 vd01-d crmd: [1875]: WARN: match_down_event: No match >> for shutdown action on vd01-b >> Mar 23 10:10:54 vd01-d crmd: [1875]: info: te_update_diff: >> Stonith/shutdown of vd01-b not matched >> Mar 23 10:10:55 vd01-d pengine: [1874]: WARN: pe_fence_node: Node vd01-b >> will be fenced because it is un-expectedly down >> Mar 23 10:10:55 vd01-d pengine: [1874]: WARN: determine_online_status: >> Node vd01-b is unclean >> ============ >> >> and one minute later vd01-b is finally fenced. >> ============ >> Mar 23 10:12:17 vd01-a crmd: [1935]: info: tengine_stonith_notify: Peer >> vd01-b was terminated (reboot) by vd01-d for vd01-d >> (ref=05cd139e-585d-452e-a22d-0ef188a64d81): OK >> Mar 23 10:12:17 vd01-a crmd: [1935]: notice: tengine_stonith_notify: >> Notified CMAN that 'vd01-b' is now fenced >> Mar 23 10:12:17 vd01-a crmd: [1935]: notice: tengine_stonith_notify: >> Confirmed CMAN fencing event for 'vd01-b' >> Mar 23 10:12:17 vd01-a fenced[1748]: fence vd01-b overridden by >> administrator intervention >> ============ >> >> Overall it took (10:08:49 - 10:12:17) three and a half minutes to fence >> failed node. >> So, for this kind of failures (crash of corosync) it could be much more >> safer to duplicate fencing in both cman and pacemaker, because it would >> take only 15-20 seconds to do the same. I'll check it a bit later, need >> to configure fencing in cman, and also check a case when fencing domain >> master fails. >> Alternative could be if fenced asks pacemaker to fence failed node (is >> this done this way?), but this will not help much if DC (my case) fails >> because election of new DC takes some time too and (I assume) pacemaker >> will refuse to do fencing without DC. And this time is enough for >> monitor ops to fail (yes, I can configure bigger timeouts, but I >> generally want cluster to be as smart as possible). >> >> Would you please comment on this? >> >> Best, >> Vladislav >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker