Hi all, my first post in this ML. I've used in 2008 heartbeat for a big project and now I'm back with pacemaker for a smaller one.
I've two nodes with drbd/clvm/ocfs2/kvm virtual machines. all in debian wheezy using testing(quite stable) packages. I've made configuration with stonith meatware and some colocation rule (if needed I can post cib file) If I stop gracefully one of two node everything works good (I mean vm resources migrate in the other node ,drbd fences and all colocation/start-stop orders are fullfilled) Bad things happens when I force to reset one of two nodes with echo b > /proc/sysrq-trigger Scenario 1) cluster software hang completely, I mean crm_mon returns 2 nodes online but the other node reboot and stay without corosync/pacemaker unloaded. No stonith message at all Scenario 2) sometimes I see the meatware stonith message, I call meatclient and the cluster hang Scenario 3) meatware message, call meat client, crm_mon returns "node unclean" but I see some resource stopped and some running or Master. Using the full configuration with ocfs2 (but I tested gfs2 too) I see these messages in syslog kernel: [ 2277.229622] INFO: task virsh:11370 blocked for more than 120 seconds. Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229629] virsh D ffff88041fc53540 0 11370 11368 0x00000000 Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229635] ffff88040b50ce60 0000000000000082 0000000000000000 ffff88040f235610 Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229642] 0000000000013540 ffff8803e1953fd8 ffff8803e1953fd8 ffff88040b50ce60 Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229648] 0000000000000246 0000000181349294 ffff8803f5ca2690 ffff8803f5ca2000 Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229655] Call Trace: Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229673] [<ffffffffa06da2d9>] ? ocfs2_wait_for_recovery+0xa2/0xbc [ocfs2] Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229679] [<ffffffff8105f51b>] ? add_wait_queue+0x3c/0x3c Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229696] [<ffffffffa06c8896>] ? ocfs2_inode_lock_full_nested+0xeb/0x925 [ocfs2] Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229714] [<ffffffffa06cdd2a>] ? ocfs2_permission+0x2b/0xe1 [ocfs2] Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229721] [<ffffffff811019e9>] ? unlazy_walk+0x100/0x132 So to simplify and exclude ocfs2 from hang I tried drbd/clvm only but resetting one node with the same echo b I see cluster hang with these messages in syslog kernel: [ 8747.118110] INFO: task clvmd:8514 blocked for more than 120 seconds. Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118119] clvmd D ffff88043fc33540 0 8514 1 0x00000000 Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118126] ffff8803e1b35810 0000000000000082 ffff880416efbd00 ffff88042f1f40c0 Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118134] 0000000000013540 ffff8803e154bfd8 ffff8803e154bfd8 ffff8803e1b35810 Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118140] ffffffff8127a5fe 0000000000000000 0000000000000000 ffff880411b8a698 Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118147] Call Trace: Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118157] [<ffffffff8127a5fe>] ? sock_sendmsg+0xc1/0xde Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118165] [<ffffffff81349227>] ? rwsem_down_failed_common+0xe0/0x114 Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118172] [<ffffffff811b1b64>] ? call_rwsem_down_read_failed+0x14/0x30 Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118177] [<ffffffff81348bad>] ? down_read+0x17/0x19 Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118195] [<ffffffffa0556a44>] ? dlm_user_request+0x3a/0x1a9 [dlm] Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118206] [<ffffffffa055e61b>] ? device_write+0x28b/0x616 [dlm] Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118214] [<ffffffff810eb4a9>] ? __kmalloc+0x100/0x112 It seems as dlm or corosync does not talk anymore or does not "sense" that the other node is gone and all pieces above stay in waiting. Corosync version is 1.4.2-2 dlm-pcmk 3.0.12-3.1 gfs-pcmk 3.0.12-3.1 ocfs2-tools-pacemaker 1.6.4-1 pacemaker 1.1.7-1 Any clue? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org