On Thu, Nov 24, 2011 at 3:58 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > 24.11.2011 07:33, Andrew Beekhof wrote: >> On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov >> <bub...@hoster-ok.com> wrote: >>> Hi Andrew, >>> >>> I just found another problem with dlm_controld.pcmk (with your latest >>> patch from github applied and also my fixes to actually build it - they >>> are included in a message referenced by this one). >>> One node which just requested fencing of another one stucks at printing >>> that message where you print ctime() in fence_node_time() (pacemaker.c >>> near 293) every second. >> >> So not blocked, it just keeps repeating that message? >> What date does it print? > > Blocked... kern_stop
I'm confused. How can it do that every second? > > It prints the same date not so far ago (in that case). > I did catch it only once and cannot repeat yet. Date is printed correct > in a "normal" fencing circumstances. > >> >> Did you change it to the following? >> log_debug("Node %d was last shot at: %s", nodeid, >> ctime(*last_fenced_time)); > > http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09959.html > contains patches against 3.0.17 which I use. I only backported commits > to dlm_controld core from 3.1.1 (and 3.1.7 last days) to make it up2date > (they are minor). Ok, this (which was from my original patch) is wrong: + log_debug("Node %d/%s was last shot at: %s", nodeid, ctime(*last_fenced_time)); The format string expects 3 parameters but there are only 2 supplied. This could easily result in what you're seeing. > > man ctime > char *ctime(const time_t *timep); > > int fence_node_time(int nodeid, uint64_t *last_fenced_time) > is called from check_fencing_done() with > uint64_t last_fenced_time; > rv = fence_node_time(node->nodeid, &last_fenced_time); > so, I changed it to ctime(last_fenced_time). btw ctime adds trailing > newline, so it badly fits for logs. > > One thought: may be last commits to dlm.git (with membership monitoring, > notably e529211682418a8e33feafc9f703cff87e23aeba) may help here? > > And one note - I use fence_xvm for that failed VM, and I found that it > is a little bit deficient - only one instance of it can be run on a host > simultaneously as it binds to the predefined TCP port. May be that may > influence as well... > >> >>> No other messages appear, although >>> fence_node_time() is called only from check_fencing_done() (cpg.c near >>> 444). So, both of (last_fenced_time >= node->fail_time) and >>> (!node->fence_queries || node->fence_time != last_fenced_time) are >>> false, otherwise one of messages for that cases should be shown. Then, >>> fence_node_time() seems to return 0 from >>> if (wait_count) >>> return 0; >>> (wait_count is incremented if (last_fenced_time >= node->fail_time) is >>> false), so it never reaches check_fencing_done() call and never return >>> expected 1. >>> Offending node was actually fenced, but that was actually not handled by >>> dlm_controld. >>> >>> May I ask you to help me a bit with all that logic (as you already dived >>> into dlm_controld sources again), I seem to be so near the success... :| >>> >>> btw, I cant find what source is your dlm repo forked from, may be you >>> remember? >> >> iirc, it was dlm.git on fedorahosted. > > Yep, I found that already, pacemaker branch. It seems to be a little bit > outdated comparing to 3.0.17 btw. > >> >>> >>> Best, >>> Vladislav >>> >>> 28.09.2011 17:41, Vladislav Bogdanov wrote: >>>> Hi Andrew, >>>> >>>>>> All the more reason to start using the stonith api directly. >>>>>> I was playing around list night with the dlm_controld.pcmk code: >>>>>> >>>>>> https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 >>>>> >>>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for >>>>> my build. Then it doesn't compile without attached patch. >>>>> It may need to be rebased a bit against your tree. >>>>> >>>>> Now I have package built and am building node images. Will try shortly. >>>> >>>> Fencing from within dlm_controld.pcmk still did not work with your first >>>> patch against that _no_mainloop function (expected). >>>> >>>> So I did my best to build packages from the current git tree. >>>> >>>> Voila! I got failed node correctly fenced! >>>> I'll do some more extensive testing next days, but I believe everything >>>> should be much better now. >>>> >>>> I knew you're genius he-he ;) >>>> >>>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with >>>> pacemaker using openais stack: >>>> >>>> 1. Build pacemaker (as of 2011-09-28) from git. >>>> 2. Apply attached patches to cluster-3.0.17 source tree. >>>> 3. Build dlm_controld.pcmk >>>> >>>> One note - gfs2_controld probably needs to be fixed too (FIXME). >>>> >>>> Best regards, >>>> Vladislav >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>> >>> > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org