On Mon, Sep 26, 2011 at 5:38 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > Hi Andrew, > > 26.09.2011 10:10, Andrew Beekhof wrote: >> On Tue, Sep 6, 2011 at 5:27 PM, Vladislav Bogdanov <bub...@hoster-ok.com> >> wrote: >>> Hi Andrew, hi all, >>> >>> I'm further investigating dlm lockspace hangs I described in >>> https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html >>> and in the thread starting from >>> https://lists.linux-foundation.org/pipermail/openais/2011-September/016701.html >>> . >>> >>> What I described there is setup which involves pacemaker-1.1.6 with >>> corosync-1.4.1 and dlm_controld.pcmk from cluster-3.0.17 (without cman). >>> I use openais stack for pacemaker. >>> >>> I found that it is possible to reproduce dlm kern_stop state across a >>> whole cluster with iptables on just one node, it is sufficient to block >>> all (or just corosync-specific) incoming/outgoing UDP for several >>> seconds (that time probably depends on corosync settings). I my case I >>> reproduced hang with 3-seconds traffic block: >>> iptables -I INPUT 1 -p udp -j REJECT; \ >>> iptables -I OUTPUT 1 -p udp -j REJECT; \ >>> sleep 3; \ >>> iptables -D INPUT 1; \ >>> iptables -D OUTPUT 1 >>> >>> I tried to make dlm_controld schedule fencing on CPG_REASON_NODEDOWN >>> event (just to look if it helps with problems I described in posts >>> referenced above), but without much success, following code does not work: >>> >>> int fd = pcmk_cluster_fd; >>> int rc = crm_terminate_member_no_mainloop(nodeid, NULL, &fd); >>> >>> I get "Could not kick node XXX from the cluster" message accompanied >>> with "No connection to the cluster". That means that >>> attrd_update_no_mainloop() fails. >>> >>> Andrew, could you please give some pointers why may it fail? I'd then >>> try to fix dlm_controld. I do not see any other uses of that function >>> except than in dlm_controld.pcmk. >> >> I can't think of anything except that attrd might not be running. Is it? > > Will recheck. > >> >> Regardless, for 1.1.6 the dlm would be better off making a call like: >> >> rc = st->cmds->fence(st, st_opts, target, "reboot", 120); >> >> from fencing/admin.c >> >> That would talk directly to the fencing daemon, bypassing attrd, crnd >> and PE - and thus be more reliable. >> >> This is what the cman plugin will be doing soon too. > > Great to know, I'll try that in near future. Thank you very much for > pointer.
1.1.7 will actually make use of this API regardless of any *_controld changes - i'm in the middle of updating the two library functions they use (crm_terminate_member and crm_terminate_member_no_mainloop). > >> >>> >>> I agree with Jiaju >>> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), >>> that could be solely pacemaker problem, because it probably should >>> originate fencing itself is such situation I think. >>> >>> So, using pacemaker/dlm with openais stack is currently risky due to >>> possible hangs of dlm_lockspaces. >> >> It shouldn't be, failing to connect to attrd is very unusual. > > By the way, one of underlying problems, which actually made me to notice > all this, is that pacemaker cluster does not fence its DC if it leaves > the cluster for a very short time. That is what Jiaju told in his notes. > And I can confirm that. Thats highly surprising. Do the logs you sent display this behaviour? > >> >>> Originally I got it due to heavy load >>> on one cluster nodes (actually on a host which has that cluster node >>> running as virtual guest). >>> >>> Ok, I switched to cman to see if it helps. Fencing is configured in >>> pacemaker, not in cluster.conf. >>> >>> Things became even worse ;( . >>> >>> Although it took 25 seconds instead of 3 to break the cluster (I >>> understand, this is almost impossible to load host so much, but >>> anyways), then I got a real nightmare: two nodes of 3-node cluster had >>> cman stopped (and pacemaker too because of cman connection loss) - they >>> asked to kick_node_from_cluster() for each other, and that succeeded. >>> But fencing didn't happen (I still need to look why, but this is cman >>> specific). >>> Remaining node had pacemaker hanged, it doesn't even >>> notice cluster infrastructure change, down nodes were listed as a >>> online, one of them was a DC, all resources are marked as started on all >>> (down too) nodes. No log entries from pacemaker at all. >> >> Well I can't see any logs from anyone to its hard for me to comment. > > Logs are sent privately. > >> >>> So, from my PoV cman+pacemaker is not currently suitable for HA tasks too. >>> >>> That means that both possible alternatives are currently unusable if one >>> needs self-repairing pacemaker cluster with dlm support ;( That is >>> really regrettable. >>> >>> I can provide all needed information and really hope that it is possible >>> to fix both issues: >>> * dlm blockage with openais and >>> * pacemaker lock with cman and no fencing from within dlm_controld >>> >>> I think both issues are really high priority, because it is definitely >>> not acceptable when problems with load on one cluster node (or with link >>> to that node) lead to a total cluster lock or even crash. >>> >>> I also offer any possible assistance from my side (f.e. patch trials >>> etc.) to get that all fixed. I can run either openais or cman and can >>> quickly switch between that stacks. >>> >>> Sorry for not being brief, >>> >>> Best regards, >>> Vladislav >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker