24.11.2011 08:49, Andrew Beekhof wrote: > On Thu, Nov 24, 2011 at 3:58 PM, Vladislav Bogdanov > <bub...@hoster-ok.com> wrote: >> 24.11.2011 07:33, Andrew Beekhof wrote: >>> On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov >>> <bub...@hoster-ok.com> wrote: >>>> Hi Andrew, >>>> >>>> I just found another problem with dlm_controld.pcmk (with your latest >>>> patch from github applied and also my fixes to actually build it - they >>>> are included in a message referenced by this one). >>>> One node which just requested fencing of another one stucks at printing >>>> that message where you print ctime() in fence_node_time() (pacemaker.c >>>> near 293) every second. >>> >>> So not blocked, it just keeps repeating that message? >>> What date does it print? >> >> Blocked... kern_stop > > I'm confused.
As well as me... > How can it do that every second? Only in one case: if both of (last_fenced_time >= node->fail_time) and (!node->fence_queries || node->fence_time != last_fenced_time) are *false*. So, three conditions are *true* at the same moment: * last_fenced_time < node->fail_time * node->fence_queries != 0 * node->fence_time == last_fenced_time If that all are true, check_fencing_done just silently returns 0. In all other cases I'd see one of messages "check_fencing %d done" or "check_fencing %d wait" (first one should stop that loop btw) in between of consequent "Node %d/%s was last shot at: %s". > >> >> It prints the same date not so far ago (in that case). >> I did catch it only once and cannot repeat yet. Date is printed correct >> in a "normal" fencing circumstances. >> >>> >>> Did you change it to the following? >>> log_debug("Node %d was last shot at: %s", nodeid, >>> ctime(*last_fenced_time)); >> >> http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09959.html >> contains patches against 3.0.17 which I use. I only backported commits >> to dlm_controld core from 3.1.1 (and 3.1.7 last days) to make it up2date >> (they are minor). > > Ok, this (which was from my original patch) is wrong: > > + log_debug("Node %d/%s was last shot at: %s", nodeid, > ctime(*last_fenced_time)); Agree, and I use log_debug("Node %d/%s was last shot at: %s", nodeid, node_uname, ctime(last_fenced_time)); Please see patches included in the message referenced above (a little bit below of the backport of your original patch). gcc sometimes is smart enough ;) > > The format string expects 3 parameters but there are only 2 supplied. > This could easily result in what you're seeing. So, no, that's not it. > > >> >> man ctime >> char *ctime(const time_t *timep); >> >> int fence_node_time(int nodeid, uint64_t *last_fenced_time) >> is called from check_fencing_done() with >> uint64_t last_fenced_time; >> rv = fence_node_time(node->nodeid, &last_fenced_time); >> so, I changed it to ctime(last_fenced_time). btw ctime adds trailing >> newline, so it badly fits for logs. >> >> One thought: may be last commits to dlm.git (with membership monitoring, >> notably e529211682418a8e33feafc9f703cff87e23aeba) may help here? >> >> And one note - I use fence_xvm for that failed VM, and I found that it >> is a little bit deficient - only one instance of it can be run on a host >> simultaneously as it binds to the predefined TCP port. May be that may >> influence as well... >> >>> >>>> No other messages appear, although >>>> fence_node_time() is called only from check_fencing_done() (cpg.c near >>>> 444). So, both of (last_fenced_time >= node->fail_time) and >>>> (!node->fence_queries || node->fence_time != last_fenced_time) are >>>> false, otherwise one of messages for that cases should be shown. Then, >>>> fence_node_time() seems to return 0 from >>>> if (wait_count) >>>> return 0; >>>> (wait_count is incremented if (last_fenced_time >= node->fail_time) is >>>> false), so it never reaches check_fencing_done() call and never return >>>> expected 1. >>>> Offending node was actually fenced, but that was actually not handled by >>>> dlm_controld. >>>> >>>> May I ask you to help me a bit with all that logic (as you already dived >>>> into dlm_controld sources again), I seem to be so near the success... :| >>>> >>>> btw, I cant find what source is your dlm repo forked from, may be you >>>> remember? >>> >>> iirc, it was dlm.git on fedorahosted. >> >> Yep, I found that already, pacemaker branch. It seems to be a little bit >> outdated comparing to 3.0.17 btw. >> >>> >>>> >>>> Best, >>>> Vladislav >>>> >>>> 28.09.2011 17:41, Vladislav Bogdanov wrote: >>>>> Hi Andrew, >>>>> >>>>>>> All the more reason to start using the stonith api directly. >>>>>>> I was playing around list night with the dlm_controld.pcmk code: >>>>>>> >>>>>>> https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 >>>>>> >>>>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for >>>>>> my build. Then it doesn't compile without attached patch. >>>>>> It may need to be rebased a bit against your tree. >>>>>> >>>>>> Now I have package built and am building node images. Will try shortly. >>>>> >>>>> Fencing from within dlm_controld.pcmk still did not work with your first >>>>> patch against that _no_mainloop function (expected). >>>>> >>>>> So I did my best to build packages from the current git tree. >>>>> >>>>> Voila! I got failed node correctly fenced! >>>>> I'll do some more extensive testing next days, but I believe everything >>>>> should be much better now. >>>>> >>>>> I knew you're genius he-he ;) >>>>> >>>>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with >>>>> pacemaker using openais stack: >>>>> >>>>> 1. Build pacemaker (as of 2011-09-28) from git. >>>>> 2. Apply attached patches to cluster-3.0.17 source tree. >>>>> 3. Build dlm_controld.pcmk >>>>> >>>>> One note - gfs2_controld probably needs to be fixed too (FIXME). >>>>> >>>>> Best regards, >>>>> Vladislav >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: >>>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>>> >>>> >> >> _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org