try separating the port values by 2 instead of 1. Regards -steve On Tue, 2010-04-13 at 11:30 +0100, Tom Pride wrote: > Hi There, > > As per the recommendations, the 2 node clusters I have built use 2 > redundant rings for added resilience. I have currently be carry out > some testing on the clusters to ensure that a failure in one of the > redundant rings can be recovered from. I am aware of the fact that > corosync does not currently have a feature which monitors failed rings > to bring them back up automatically when communications are repaired. > All I have been doing is testing to see that the corosync-cfgtool -r > command will do as it says on the tin and "Reset redundant ring state > cluster wide after a fault, to re-enable redundant ring operation." > > In my 2 node cluster I have been issuing the ifdown command on eth1 on > node1. This results in corosync-cfgtool -s reporting the following: > > r...@mq006:~# corosync-cfgtool -s > Printing ring status. > Local node ID 71056300 > RING ID 0 > id = 172.59.60.4 > status = Marking seqid 8574 ringid 0 interface 172.59.60.4 > FAULTY - adminisrtative intervention required. > RING ID 1 > id = 172.23.42.37 > status = ring 1 active with no faults > > I then issue ifup eth1 on node1 and ensure that I can now ping node2. > The link is definitely up, so I then issue the command > corosync-cfgtool -r. I then run corosync-cfgtool -s again and it > reports: > > r...@mq006:~# corosync-cfgtool -s > Printing ring status. > Local node ID 71056300 > RING ID 0 > id = 172.59.60.4 > status = ring 0 active with no faults > RING ID 1 > id = 172.23.42.37 > status = ring 1 active with no faults > > So things are looking good at this point, but if I wait 10 more > seconds and run corosync-cfgtool -s again, it reports that ring_id 0 > is FAULTY again: > > r...@mq006:~# corosync-cfgtool -s > Printing ring status. > Local node ID 71056300 > RING ID 0 > id = 172.59.60.4 > status = Marking seqid 8574 ringid 0 interface 172.59.60.4 > FAULTY - adminisrtative intervention required. > RING ID 1 > id = 172.23.42.37 > status = ring 1 active with no faults > > It does not matter how many times I run corosync-cfgtool -r, ring_id 0 > will report it as being FAULTY 10 seconds after issuing the reset. I > have tried running /etc/init.d/network restart on node1 in the hope > that a full network stop and start makes a difference, but it doesn't. > The only thing that will fix this situation is if I completely stop > and restart the corosync cluster stack on both nodes > (/etc/init.d/corosync stop and /etc/init.d/corosync start). Once I've > done that both rings stay up and are stable. This is obviously not > what we want. > > I am running the latest RHEL rpms from here: > http://www.clusterlabs.org/rpm/epel-5/x86_64/ > > corosync-1.2.1-1.el5 > corosynclib-1.2.1-1.el5 > pacemaker-1.0.8-4.el5 > pacemaker-libs-1.0.8-4.el5 > > My corosync.conf looks like this: > compatibility: whitetank > > totem { > version: 2 > secauth: off > threads: 0 > consensus: 1201 > rrp_mode: passive > interface { > ringnumber: 0 > bindnetaddr: 172.59.60.0 > mcastaddr: 226.94.1.1 > mcastport: 4010 > } > interface { > ringnumber: 1 > bindnetaddr: 172.23.40.0 > mcastaddr: 226.94.2.1 > mcastport: 4011 > } > } > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > logfile: /tmp/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > > amf { > mode: disabled > } > > service { > # Load the Pacemaker Cluster Resource Manager > name: pacemaker > ver: 0 > } > > aisexec { > user: root > group: root > } > > > This is what gets written into /tmp/corosync.log when I carry out the > link failure test and then try and reset the ring status: > r...@mq005:~/activemq_rpms# cat /tmp/corosync.log > Apr 13 11:20:31 corosync [MAIN ] Corosync Cluster Engine ('1.2.1'): > started and ready to provide service. > Apr 13 11:20:31 corosync [MAIN ] Corosync built-in features: nss rdma > Apr 13 11:20:31 corosync [MAIN ] Successfully read main configuration > file '/etc/corosync/corosync.conf'. > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP). > Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive > security: libtomcrypt SOBER128/SHA1HMAC (mode 0). > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP). > Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive > security: libtomcrypt SOBER128/SHA1HMAC (mode 0). > Apr 13 11:20:31 corosync [TOTEM ] The network interface [172.59.60.3] > is now up. > Apr 13 11:20:31 corosync [pcmk ] info: process_ais_conf: Reading > configure > Apr 13 11:20:31 corosync [pcmk ] info: config_find_init: Local > handle: 4730966301143465986 for logging > Apr 13 11:20:31 corosync [pcmk ] info: config_find_next: Processing > additional logging options... > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Found 'off' > for option: debug > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to > 'off' for option: to_file > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Found 'yes' > for option: to_syslog > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to > 'daemon' for option: syslog_facility > Apr 13 11:20:31 corosync [pcmk ] info: config_find_init: Local > handle: 7739444317642555395 for service > Apr 13 11:20:31 corosync [pcmk ] info: config_find_next: Processing > additional service options... > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to > 'pcmk' for option: clustername > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to > 'no' for option: use_logd > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to > 'no' for option: use_mgmtd > Apr 13 11:20:31 corosync [pcmk ] info: pcmk_startup: CRM: Initialized > Apr 13 11:20:31 corosync [pcmk ] Logging: Initialized pcmk_startup > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Maximum core > file size is: 18446744073709551615 > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Service: 9 > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Local hostname: > mq005.back.int.cwwtf.local > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_update_nodeid: Local node > id: 54279084 > Apr 13 11:20:32 corosync [pcmk ] info: update_member: Creating entry > for node 54279084 born on 0 > Apr 13 11:20:32 corosync [pcmk ] info: update_member: 0x5452c00 Node > 54279084 now known as mq005.back.int.cwwtf.local (was: (null)) > Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node > mq005.back.int.cwwtf.local now has 1 quorum votes (was 0) > Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node > 54279084/mq005.back.int.cwwtf.local is now: member > Apr 13 11:20:32 corosync [pcmk ] info: spawn_child: Forked child > 11873 for process stonithd > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child > 11874 for process cib > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child > 11875 for process lrmd > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child > 11876 for process attrd > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child > 11877 for process pengine > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child > 11878 for process crmd > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: Pacemaker > Cluster Manager 1.0.8 > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync > extended virtual synchrony service > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync > configuration service > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync > cluster closed process group service v1.01 > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync > cluster config database access v1.01 > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync > profile loading service > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync > cluster quorum service v0.1 > Apr 13 11:20:33 corosync [MAIN ] Compatibility mode set to whitetank. > Using V1 and V2 of the synchronization engine. > Apr 13 11:20:33 corosync [TOTEM ] The network interface [172.23.42.36] > is now up. > Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 640: memb=0, new=0, lost=0 > Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update: Stable > membership event on ring 640: memb=1, new=1, lost=0 > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update: NEW: > mq005.back.int.cwwtf.local 54279084 > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update: MEMB: > mq005.back.int.cwwtf.local 54279084 > Apr 13 11:20:33 corosync [pcmk ] info: update_member: Node > mq005.back.int.cwwtf.local now has process list: > 00000000000000000000000000013312 (78610) > Apr 13 11:20:33 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Apr 13 11:20:33 corosync [MAIN ] Completed service synchronization, > ready to provide service. > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded connection > 0x545a660 for attrd/11876 > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded connection > 0x545b290 for stonithd/11873 > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded connection > 0x545d4e0 for cib/11874 > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Sending membership > update 640 to cib > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Recorded connection > 0x545e210 for crmd/11878 > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Sending membership > update 640 to crmd > Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 648: memb=1, new=0, lost=0 > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: memb: > mq005.back.int.cwwtf.local 54279084 > Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update: Stable > membership event on ring 648: memb=2, new=1, lost=0 > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Creating entry > for node 71056300 born on 648 > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node > 71056300/unknown is now: member > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: > NEW: .pending. 71056300 > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: MEMB: > mq005.back.int.cwwtf.local 54279084 > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: > MEMB: .pending. 71056300 > Apr 13 11:20:34 corosync [pcmk ] info: send_member_notification: > Sending membership update 648 to 2 children > Apr 13 11:20:34 corosync [pcmk ] info: update_member: 0x5452c00 Node > 54279084 ((null)) born on: 648 > Apr 13 11:20:34 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Apr 13 11:20:34 corosync [pcmk ] info: update_member: 0x545dd00 Node > 71056300 (mq006.back.int.cwwtf.local) born on: 648 > Apr 13 11:20:34 corosync [pcmk ] info: update_member: 0x545dd00 Node > 71056300 now known as mq006.back.int.cwwtf.local (was: (null)) > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node > mq006.back.int.cwwtf.local now has process list: > 00000000000000000000000000013312 (78610) > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node > mq006.back.int.cwwtf.local now has 1 quorum votes (was 0) > Apr 13 11:20:34 corosync [pcmk ] info: send_member_notification: > Sending membership update 648 to 2 children > Apr 13 11:20:34 corosync [MAIN ] Completed service synchronization, > ready to provide service. > Apr 13 11:23:34 corosync [TOTEM ] Marking seqid 6843 ringid 0 > interface 172.59.60.3 FAULTY - adminisrtative intervention required. > Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0 interface > 172.59.60.3 FAULTY - adminisrtative intervention required. > Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0 interface > 172.59.60.3 FAULTY - adminisrtative intervention required. > Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0 interface > 172.59.60.3 FAULTY - adminisrtative intervention required. > > > Can anyone help me out with this? Am I doing something wrong or have > I found a bug? > > Cheers, > Tom > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
