On Fri, 2012-12-14 at 08:39 +0100, Emmanuel Saint-Joanis wrote: > 2012/12/14 Muhammad Sharfuddin <[email protected]> > node1(ailprd1) IP:192.168.7.11 > node2(ailprd2) IP:192.168.7.12 > > Its a two node active/passive cluster, running perfectly since > last two > months, but yesterday both nodes were fenced(rebooted). > Network > connectivity b/w both nodes is perfect, and cluster is running > fine > again. > > Help me know the reason behind the following situation, and > how can I > avoid it happening next time: > > on node1(active node): > Dec 13 12:31:06 ailprd1 corosync[7274]: [TOTEM ] A processor > failed, > forming new configuration. > Dec 13 12:31:12 ailprd1 corosync[7274]: [CLM ] CLM > CONFIGURATION CHANGE > Dec 13 12:31:12 ailprd1 corosync[7274]: [CLM ] New > Configuration: > Dec 13 12:31:13 ailprd1 corosync[7274]: [CLM ] r(0) > ip(192.168.7.11) > Dec 13 12:31:13 ailprd1 corosync[7274]: [CLM ] Members Left: > Dec 13 12:31:13 ailprd1 corosync[7274]: [CLM ] r(0) > ip(192.168.7.12) > > on node2(passive node): > Dec 13 12:31:05 ailprd2 corosync[7021]: [TOTEM ] A processor > failed, > forming new configuration. > Dec 13 12:31:11 ailprd2 corosync[7021]: [CLM ] CLM > CONFIGURATION CHANGE > Dec 13 12:31:11 ailprd2 corosync[7021]: [CLM ] New > Configuration: > Dec 13 12:31:11 ailprd2 corosync[7021]: [CLM ] r(0) > ip(192.168.7.12) > Dec 13 12:31:11 ailprd2 corosync[7021]: [CLM ] Members Left: > Dec 13 12:31:11 ailprd2 corosync[7021]: [CLM ] r(0) > ip(192.168.7.11) > > for node1(ailprd1) node2 left, likewise node2(ailprd2) thinks > that node1 > left. then node2 tries to start the resources which were > already running > on node1, and both nodes were fenced. > > corosync.conf : > totem { > rrp_mode: none > join: 60 > max_messages: 20 > vsftype: none > consensus: 6000 > secauth: off > token_retransmits_before_loss_const: 10 > token: 5000 > version: 2 > > interface { > bindnetaddr: 192.168.7.0 > mcastaddr: 224.0.0.116 > mcastport: 51234 > ringnumber: 0 > } > clear_node_high_bit: yes > .../... > > > What's Corosync version ? 2.0 I guess > Maybe try on each node : > tcpdump -i eth0 -envv "port 51234" > > > to see if traffic can go thru. > What says ? : > corosync-objctl | grep member (if in v.1) > corosync-cmapctl | grep member (if in v.2) > >
ailprd1:~/Desktop # corosync-objctl |grep member runtime.totem.pg.mrp.srp.members.185051328.ip=r(0) ip(192.168.7.11) runtime.totem.pg.mrp.srp.members.185051328.join_count=1 runtime.totem.pg.mrp.srp.members.185051328.status=joined runtime.totem.pg.mrp.srp.members.201828544.ip=r(0) ip(192.168.7.12) runtime.totem.pg.mrp.srp.members.201828544.join_count=1 runtime.totem.pg.mrp.srp.members.201828544.status=joined also ailprd1:~/Desktop # tcpdump -i bond0 -envv "port 51234" tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 96 bytes 15:07:33.117378 00:10:18:9a:1e:7c > 01:00:5e:00:00:74, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 110) 192.168.7.11.51233 > 224.0.0.116.51234: UDP, length 82 15:07:33.299420 00:10:18:9a:1e:7c > 00:10:18:9a:21:c8, ethertype IPv4 (0x0800), length 112: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 98) 192.168.7.11.51233 > 192.168.7.12.51234: UDP, length 70 15:07:33.299501 00:10:18:9a:21:c8 > 00:10:18:9a:1e:7c, ethertype IPv4 (0x0800), length 112: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 98) 192.168.7.12.51233 > 192.168.7.11.51234: UDP, length 70 15:07:33.508558 00:10:18:9a:1e:7c > 01:00:5e:00:00:74, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 110) 192.168.7.11.51233 > 224.0.0.116.51234: UDP, length 82 15:07:33.690607 00:10:18:9a:1e:7c > 00:10:18:9a:21:c8, ethertype IPv4 (0x0800), length 112: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 98) 192.168.7.11.51233 > 192.168.7.12.51234: UDP, length 70 . . . 15:07:56.768994 00:10:18:9a:21:c8 > 00:10:18:9a:1e:7c, ethertype IPv4 (0x0800), length 112: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 98) 192.168.7.12.51233 > 192.168.7.11.51234: UDP, length 70 ^C 183 packets captured 183 packets received by filter 0 packets dropped by kernel -- Regards, Muhammad Sharfuddin _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
