On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > Hi Andrew, > > I just found another problem with dlm_controld.pcmk (with your latest > patch from github applied and also my fixes to actually build it - they > are included in a message referenced by this one). > One node which just requested fencing of another one stucks at printing > that message where you print ctime() in fence_node_time() (pacemaker.c > near 293) every second.
So not blocked, it just keeps repeating that message? What date does it print? Did you change it to the following? log_debug("Node %d was last shot at: %s", nodeid, ctime(*last_fenced_time)); > No other messages appear, although > fence_node_time() is called only from check_fencing_done() (cpg.c near > 444). So, both of (last_fenced_time >= node->fail_time) and > (!node->fence_queries || node->fence_time != last_fenced_time) are > false, otherwise one of messages for that cases should be shown. Then, > fence_node_time() seems to return 0 from > if (wait_count) > return 0; > (wait_count is incremented if (last_fenced_time >= node->fail_time) is > false), so it never reaches check_fencing_done() call and never return > expected 1. > Offending node was actually fenced, but that was actually not handled by > dlm_controld. > > May I ask you to help me a bit with all that logic (as you already dived > into dlm_controld sources again), I seem to be so near the success... :| > > btw, I cant find what source is your dlm repo forked from, may be you > remember? iirc, it was dlm.git on fedorahosted. > > Best, > Vladislav > > 28.09.2011 17:41, Vladislav Bogdanov wrote: >> Hi Andrew, >> >>>> All the more reason to start using the stonith api directly. >>>> I was playing around list night with the dlm_controld.pcmk code: >>>> >>>> https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 >>> >>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for >>> my build. Then it doesn't compile without attached patch. >>> It may need to be rebased a bit against your tree. >>> >>> Now I have package built and am building node images. Will try shortly. >> >> Fencing from within dlm_controld.pcmk still did not work with your first >> patch against that _no_mainloop function (expected). >> >> So I did my best to build packages from the current git tree. >> >> Voila! I got failed node correctly fenced! >> I'll do some more extensive testing next days, but I believe everything >> should be much better now. >> >> I knew you're genius he-he ;) >> >> So, here are steps to get DLM handle CPG NODEDOWN events correctly with >> pacemaker using openais stack: >> >> 1. Build pacemaker (as of 2011-09-28) from git. >> 2. Apply attached patches to cluster-3.0.17 source tree. >> 3. Build dlm_controld.pcmk >> >> One note - gfs2_controld probably needs to be fixed too (FIXME). >> >> Best regards, >> Vladislav >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org