On Mon, Dec 19, 2011 at 11:11 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > 19.12.2011 14:39, Vladislav Bogdanov wrote: >> 09.12.2011 08:44, Andrew Beekhof wrote: >>> On Fri, Dec 9, 2011 at 3:16 PM, Vladislav Bogdanov <bub...@hoster-ok.com> >>> wrote: >>>> 09.12.2011 03:11, Andrew Beekhof wrote: >>>>> On Fri, Dec 2, 2011 at 1:32 AM, Vladislav Bogdanov <bub...@hoster-ok.com> >>>>> wrote: >>>>>> Hi Andrew, >>>>>> >>>>>> I investigated on my test cluster what actually happens with dlm and >>>>>> fencing. >>>>>> >>>>>> I added more debug messages to dlm dump, and also did a re-kick of nodes >>>>>> after some time. >>>>>> >>>>>> Results are that stonith history actually doesn't contain any >>>>>> information until pacemaker decides to fence node itself. >>>>> >>>>> ... >>>>> >>>>>> From my PoV that means that the call to >>>>>> crm_terminate_member_no_mainloop() does not actually schedule fencing >>>>>> operation. >>>>> >>>>> You're going to have to remind me... what does your copy of >>>>> crm_terminate_member_no_mainloop() look like? >>>>> This is with the non-cman editions of the controlds too right? >>>> >>>> Just latest github's version. You changed some dlm_controld.pcmk >>>> functionality, so it asks stonithd for fencing results instead of XML >>>> magic. But call to crm_terminate_member_no_mainloop() remains the same >>>> there. But yes, that version communicates stonithd directly too. >>>> >>>> SO, the problem here is just with crm_terminate_member_no_mainloop() >>>> which for some reason skips actual fencing request. >>> >>> There should be some logs, either indicating that it tried, or that it >>> failed. >> >> Nothing about fencing. >> Only messages about history requests: >> >> stonith-ng: [1905]: info: stonith_command: Processed st_fence_history >> from cluster-dlm: rc=0 >> >> I even moved all fencing code to dlm_controld to have better control on >> what does it do (and not to rebuild pacemaker to play with that code). >> dlm_tool dump prints the same line every second, stonith-ng prints >> history requests. >> >> A little bit odd, by I saw one time that fencing request from >> cluster-dlm succeeded, but only right after node was fenced by >> pacemaker. As a result, node was switched off instead of reboot. >> >> That raises one more question: is it correct to call st->cmds->fence() >> with third parameter set to "off"? >> I think that "reboot" is more consistent with the rest of fencing subsystem. >> >> At the same time, stonith_admin -B succeeds. >> The main difference I see is st_opt_sync_call in a latter case. >> Will try to experiment with it. > > Yeeeesssss!!! > > Now I see following: > Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info: > pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be fenced
So the important question... what did you change? > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: > initiate_remote_stonith_op: Initiating remote operation reboot for > vd01-b: 21425fc0-4311-40fa-9647-525c3f258471 > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node > vd01-c now has id: 1107559690 > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: > Processed st_query from vd01-c: rc=0 > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node > vd01-d now has id: 1124336906 > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: > Processed st_query from vd01-d: rc=0 > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: > Processed st_query from vd01-a: rc=0 > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: call_remote_stonith: > Requesting that vd01-c perform op reboot vd01-b > Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node > vd01-b now has id: 1090782474 > ... > Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: stonith_command: > Processed st_fence_history from cluster-dlm: rc=0 > Dec 19 11:53:40 vd01-a crmd: [1910]: info: tengine_stonith_notify: Peer > vd01-b was terminated (reboot) by vd01-c for vd01-a > (ref=21425fc0-4311-40fa-9647-525c3f258471): OK > > But, then I see minor issue that node is marked to be fenced again: > Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: pe_fence_node: Node vd01-b > will be fenced because it is un-expectedly down Do you have logs for that? tengine_stonith_notify() got called, that should have been enough to get the node cleaned up in the cib. > ... > Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: stage6: Scheduling Node > vd01-b for STONITH > ... > Dec 19 11:53:40 vd01-a crmd: [1910]: info: te_fence_node: Executing > reboot fencing operation (249) on vd01-b (timeout=60000) > ... > Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: call_remote_stonith: > Requesting that vd01-c perform op reboot vd01-b > > And so on. > > I can't investigated this one in more depth, because I use fence_xvm in > this testing cluster, and it has issues when running more than one > stonith resource on a node. Also, my RA (in a cluster where this testing > cluster runs) undefines VM after failure, so fence_xvm does not see > fencing victim in a qpid and is unable to fence it again. > > May be it is possible to look if node was just fenced and skip redundant > fencing? If the callbacks are being used correctly, it shouldn't be required _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org