24.11.2011 07:33, Andrew Beekhof wrote: > On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov > <bub...@hoster-ok.com> wrote: >> Hi Andrew, >> >> I just found another problem with dlm_controld.pcmk (with your latest >> patch from github applied and also my fixes to actually build it - they >> are included in a message referenced by this one). >> One node which just requested fencing of another one stucks at printing >> that message where you print ctime() in fence_node_time() (pacemaker.c >> near 293) every second. > > So not blocked, it just keeps repeating that message? > What date does it print?
Blocked... kern_stop It prints the same date not so far ago (in that case). I did catch it only once and cannot repeat yet. Date is printed correct in a "normal" fencing circumstances. > > Did you change it to the following? > log_debug("Node %d was last shot at: %s", nodeid, > ctime(*last_fenced_time)); http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09959.html contains patches against 3.0.17 which I use. I only backported commits to dlm_controld core from 3.1.1 (and 3.1.7 last days) to make it up2date (they are minor). man ctime char *ctime(const time_t *timep); int fence_node_time(int nodeid, uint64_t *last_fenced_time) is called from check_fencing_done() with uint64_t last_fenced_time; rv = fence_node_time(node->nodeid, &last_fenced_time); so, I changed it to ctime(last_fenced_time). btw ctime adds trailing newline, so it badly fits for logs. One thought: may be last commits to dlm.git (with membership monitoring, notably e529211682418a8e33feafc9f703cff87e23aeba) may help here? And one note - I use fence_xvm for that failed VM, and I found that it is a little bit deficient - only one instance of it can be run on a host simultaneously as it binds to the predefined TCP port. May be that may influence as well... > >> No other messages appear, although >> fence_node_time() is called only from check_fencing_done() (cpg.c near >> 444). So, both of (last_fenced_time >= node->fail_time) and >> (!node->fence_queries || node->fence_time != last_fenced_time) are >> false, otherwise one of messages for that cases should be shown. Then, >> fence_node_time() seems to return 0 from >> if (wait_count) >> return 0; >> (wait_count is incremented if (last_fenced_time >= node->fail_time) is >> false), so it never reaches check_fencing_done() call and never return >> expected 1. >> Offending node was actually fenced, but that was actually not handled by >> dlm_controld. >> >> May I ask you to help me a bit with all that logic (as you already dived >> into dlm_controld sources again), I seem to be so near the success... :| >> >> btw, I cant find what source is your dlm repo forked from, may be you >> remember? > > iirc, it was dlm.git on fedorahosted. Yep, I found that already, pacemaker branch. It seems to be a little bit outdated comparing to 3.0.17 btw. > >> >> Best, >> Vladislav >> >> 28.09.2011 17:41, Vladislav Bogdanov wrote: >>> Hi Andrew, >>> >>>>> All the more reason to start using the stonith api directly. >>>>> I was playing around list night with the dlm_controld.pcmk code: >>>>> >>>>> https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 >>>> >>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for >>>> my build. Then it doesn't compile without attached patch. >>>> It may need to be rebased a bit against your tree. >>>> >>>> Now I have package built and am building node images. Will try shortly. >>> >>> Fencing from within dlm_controld.pcmk still did not work with your first >>> patch against that _no_mainloop function (expected). >>> >>> So I did my best to build packages from the current git tree. >>> >>> Voila! I got failed node correctly fenced! >>> I'll do some more extensive testing next days, but I believe everything >>> should be much better now. >>> >>> I knew you're genius he-he ;) >>> >>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with >>> pacemaker using openais stack: >>> >>> 1. Build pacemaker (as of 2011-09-28) from git. >>> 2. Apply attached patches to cluster-3.0.17 source tree. >>> 3. Build dlm_controld.pcmk >>> >>> One note - gfs2_controld probably needs to be fixed too (FIXME). >>> >>> Best regards, >>> Vladislav >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org