Just to clarify, when I ifdown eth1 corosync does detect a failure and it does mark the ring as faulty. Are you saying that when I use ifup corosync can't work out that the interface is back up and communications can resume when I run corosync-cfgtool -r ? Would I therefore get a different result if I introduced the failure by physically unplugging the cat5 from the server and then physically reconnecting the cat5? What about if I shut down the port on the switch it is connected to?
On Tue, Apr 13, 2010 at 6:33 PM, Steven Dake <[email protected]> wrote: > On Tue, 2010-04-13 at 17:04 +0100, Tom Pride wrote: > > Hi Steve, > > > > Thanks for the suggestion but that didn't work. I'm not sure if you > > read my entire post or not, but the two redundant rings that I have > > configured, both work without a problem until I introduce a fault by > > shutting down eth1 on one of the nodes. This then causes the cluster > > to mark ringid 0 as FAULTY. When I then reactivate eth1 and both > > nodes can once again ping each other over the network, I then run > > corosync-cfgtool -r which should re-enable the FAULTY redundant ring > > within corosync, but it doesn't work. Corosync refuses to re-enable > > the ring even though there is no longer any network fault. > > > > By deactivating eth1, i assume you mean you ifdown eth1. Unfortunately > taking a network interface out of service while using redundant ring > doesn't work properly. To verify that a failure on that interface is > detected, i recommend using iptables to block the ports related to > corosync. > > a bit more detail: > > http://www.corosync.org/doku.php?id=faq:ifdown > > > I might be mistaken, but isn't the trick of separating the port values > > by 2 instead of 1 only for when you are using broadcast instead of the > > recommended multicast? I'm using multicast. > > > > Thought it may make a difference on the local interface port used for > udp messages (the token), but wasn't sure. > > Regards > -steve > > > Any more suggestions? > > > > Cheers, > > Tom > > > > On Tue, Apr 13, 2010 at 4:37 PM, Steven Dake <[email protected]> wrote: > > try separating the port values by 2 instead of 1. > > > > Regards > > -steve > > > > On Tue, 2010-04-13 at 11:30 +0100, Tom Pride wrote: > > > Hi There, > > > > > > As per the recommendations, the 2 node clusters I have built > > use 2 > > > redundant rings for added resilience. I have currently be > > carry out > > > some testing on the clusters to ensure that a failure in one > > of the > > > redundant rings can be recovered from. I am aware of the > > fact that > > > corosync does not currently have a feature which monitors > > failed rings > > > to bring them back up automatically when communications are > > repaired. > > > All I have been doing is testing to see that the > > corosync-cfgtool -r > > > command will do as it says on the tin and "Reset redundant > > ring state > > > cluster wide after a fault, to re-enable redundant ring > > operation." > > > > > > In my 2 node cluster I have been issuing the ifdown command > > on eth1 on > > > node1. This results in corosync-cfgtool -s reporting the > > following: > > > > > > r...@mq006:~# corosync-cfgtool -s > > > Printing ring status. > > > Local node ID 71056300 > > > RING ID 0 > > > id = 172.59.60.4 > > > status = Marking seqid 8574 ringid 0 interface > > 172.59.60.4 > > > FAULTY - adminisrtative intervention required. > > > RING ID 1 > > > id = 172.23.42.37 > > > status = ring 1 active with no faults > > > > > > I then issue ifup eth1 on node1 and ensure that I can now > > ping node2. > > > The link is definitely up, so I then issue the command > > > corosync-cfgtool -r. I then run corosync-cfgtool -s again > > and it > > > reports: > > > > > > r...@mq006:~# corosync-cfgtool -s > > > Printing ring status. > > > Local node ID 71056300 > > > RING ID 0 > > > id = 172.59.60.4 > > > status = ring 0 active with no faults > > > RING ID 1 > > > id = 172.23.42.37 > > > status = ring 1 active with no faults > > > > > > So things are looking good at this point, but if I wait 10 > > more > > > seconds and run corosync-cfgtool -s again, it reports that > > ring_id 0 > > > is FAULTY again: > > > > > > r...@mq006:~# corosync-cfgtool -s > > > Printing ring status. > > > Local node ID 71056300 > > > RING ID 0 > > > id = 172.59.60.4 > > > status = Marking seqid 8574 ringid 0 interface > > 172.59.60.4 > > > FAULTY - adminisrtative intervention required. > > > RING ID 1 > > > id = 172.23.42.37 > > > status = ring 1 active with no faults > > > > > > It does not matter how many times I run corosync-cfgtool -r, > > ring_id 0 > > > will report it as being FAULTY 10 seconds after issuing the > > reset. I > > > have tried running /etc/init.d/network restart on node1 in > > the hope > > > that a full network stop and start makes a difference, but > > it doesn't. > > > The only thing that will fix this situation is if I > > completely stop > > > and restart the corosync cluster stack on both nodes > > > (/etc/init.d/corosync stop and /etc/init.d/corosync start). > > Once I've > > > done that both rings stay up and are stable. This is > > obviously not > > > what we want. > > > > > > I am running the latest RHEL rpms from here: > > > http://www.clusterlabs.org/rpm/epel-5/x86_64/ > > > > > > corosync-1.2.1-1.el5 > > > corosynclib-1.2.1-1.el5 > > > pacemaker-1.0.8-4.el5 > > > pacemaker-libs-1.0.8-4.el5 > > > > > > My corosync.conf looks like this: > > > compatibility: whitetank > > > > > > totem { > > > version: 2 > > > secauth: off > > > threads: 0 > > > consensus: 1201 > > > rrp_mode: passive > > > interface { > > > ringnumber: 0 > > > bindnetaddr: 172.59.60.0 > > > mcastaddr: 226.94.1.1 > > > mcastport: 4010 > > > } > > > interface { > > > ringnumber: 1 > > > bindnetaddr: 172.23.40.0 > > > mcastaddr: 226.94.2.1 > > > mcastport: 4011 > > > } > > > } > > > > > > logging { > > > fileline: off > > > to_stderr: yes > > > to_logfile: yes > > > to_syslog: yes > > > logfile: /tmp/corosync.log > > > debug: off > > > timestamp: on > > > logger_subsys { > > > subsys: AMF > > > debug: off > > > } > > > } > > > > > > amf { > > > mode: disabled > > > } > > > > > > service { > > > # Load the Pacemaker Cluster Resource Manager > > > name: pacemaker > > > ver: 0 > > > } > > > > > > aisexec { > > > user: root > > > group: root > > > } > > > > > > > > > This is what gets written into /tmp/corosync.log when I > > carry out the > > > link failure test and then try and reset the ring status: > > > r...@mq005:~/activemq_rpms# cat /tmp/corosync.log > > > Apr 13 11:20:31 corosync [MAIN ] Corosync Cluster Engine > > ('1.2.1'): > > > started and ready to provide service. > > > Apr 13 11:20:31 corosync [MAIN ] Corosync built-in > > features: nss rdma > > > Apr 13 11:20:31 corosync [MAIN ] Successfully read main > > configuration > > > file '/etc/corosync/corosync.conf'. > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport > > (UDP/IP). > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing > > transmit/receive > > > security: libtomcrypt SOBER128/SHA1HMAC (mode 0). > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport > > (UDP/IP). > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing > > transmit/receive > > > security: libtomcrypt SOBER128/SHA1HMAC (mode 0). > > > Apr 13 11:20:31 corosync [TOTEM ] The network interface > > [172.59.60.3] > > > is now up. > > > Apr 13 11:20:31 corosync [pcmk ] info: process_ais_conf: > > Reading > > > configure > > > Apr 13 11:20:31 corosync [pcmk ] info: config_find_init: > > Local > > > handle: 4730966301143465986 for logging > > > Apr 13 11:20:31 corosync [pcmk ] info: config_find_next: > > Processing > > > additional logging options... > > > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: > > Found 'off' > > > for option: debug > > > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: > > Defaulting to > > > 'off' for option: to_file > > > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: > > Found 'yes' > > > for option: to_syslog > > > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: > > Defaulting to > > > 'daemon' for option: syslog_facility > > > Apr 13 11:20:31 corosync [pcmk ] info: config_find_init: > > Local > > > handle: 7739444317642555395 for service > > > Apr 13 11:20:31 corosync [pcmk ] info: config_find_next: > > Processing > > > additional service options... > > > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: > > Defaulting to > > > 'pcmk' for option: clustername > > > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: > > Defaulting to > > > 'no' for option: use_logd > > > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: > > Defaulting to > > > 'no' for option: use_mgmtd > > > Apr 13 11:20:31 corosync [pcmk ] info: pcmk_startup: CRM: > > Initialized > > > Apr 13 11:20:31 corosync [pcmk ] Logging: Initialized > > pcmk_startup > > > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: > > Maximum core > > > file size is: 18446744073709551615 > > > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: > > Service: 9 > > > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Local > > hostname: > > > mq005.back.int.cwwtf.local > > > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_update_nodeid: > > Local node > > > id: 54279084 > > > Apr 13 11:20:32 corosync [pcmk ] info: update_member: > > Creating entry > > > for node 54279084 born on 0 > > > Apr 13 11:20:32 corosync [pcmk ] info: update_member: > > 0x5452c00 Node > > > 54279084 now known as mq005.back.int.cwwtf.local (was: > > (null)) > > > Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node > > > mq005.back.int.cwwtf.local now has 1 quorum votes (was 0) > > > Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node > > > 54279084/mq005.back.int.cwwtf.local is now: member > > > Apr 13 11:20:32 corosync [pcmk ] info: spawn_child: Forked > > child > > > 11873 for process stonithd > > > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked > > child > > > 11874 for process cib > > > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked > > child > > > 11875 for process lrmd > > > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked > > child > > > 11876 for process attrd > > > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked > > child > > > 11877 for process pengine > > > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked > > child > > > 11878 for process crmd > > > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: > > Pacemaker > > > Cluster Manager 1.0.8 > > > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: > > corosync > > > extended virtual synchrony service > > > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: > > corosync > > > configuration service > > > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: > > corosync > > > cluster closed process group service v1.01 > > > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: > > corosync > > > cluster config database access v1.01 > > > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: > > corosync > > > profile loading service > > > Apr 13 11:20:33 corosync [SERV ] Service engine loaded: > > corosync > > > cluster quorum service v0.1 > > > Apr 13 11:20:33 corosync [MAIN ] Compatibility mode set to > > whitetank. > > > Using V1 and V2 of the synchronization engine. > > > Apr 13 11:20:33 corosync [TOTEM ] The network interface > > [172.23.42.36] > > > is now up. > > > Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update: > > > Transitional membership event on ring 640: memb=0, new=0, > > lost=0 > > > Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update: > > Stable > > > membership event on ring 640: memb=1, new=1, lost=0 > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update: > > NEW: > > > mq005.back.int.cwwtf.local 54279084 > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update: > > MEMB: > > > mq005.back.int.cwwtf.local 54279084 > > > Apr 13 11:20:33 corosync [pcmk ] info: update_member: Node > > > mq005.back.int.cwwtf.local now has process list: > > > 00000000000000000000000000013312 (78610) > > > Apr 13 11:20:33 corosync [TOTEM ] A processor joined or left > > the > > > membership and a new membership was formed. > > > Apr 13 11:20:33 corosync [MAIN ] Completed service > > synchronization, > > > ready to provide service. > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded > > connection > > > 0x545a660 for attrd/11876 > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded > > connection > > > 0x545b290 for stonithd/11873 > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded > > connection > > > 0x545d4e0 for cib/11874 > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Sending > > membership > > > update 640 to cib > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Recorded > > connection > > > 0x545e210 for crmd/11878 > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Sending > > membership > > > update 640 to crmd > > > Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update: > > > Transitional membership event on ring 648: memb=1, new=0, > > lost=0 > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: > > memb: > > > mq005.back.int.cwwtf.local 54279084 > > > Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update: > > Stable > > > membership event on ring 648: memb=2, new=1, lost=0 > > > Apr 13 11:20:34 corosync [pcmk ] info: update_member: > > Creating entry > > > for node 71056300 born on 648 > > > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node > > > 71056300/unknown is now: member > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: > > > NEW: .pending. 71056300 > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: > > MEMB: > > > mq005.back.int.cwwtf.local 54279084 > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: > > > MEMB: .pending. 71056300 > > > Apr 13 11:20:34 corosync [pcmk ] info: > > send_member_notification: > > > Sending membership update 648 to 2 children > > > Apr 13 11:20:34 corosync [pcmk ] info: update_member: > > 0x5452c00 Node > > > 54279084 ((null)) born on: 648 > > > Apr 13 11:20:34 corosync [TOTEM ] A processor joined or left > > the > > > membership and a new membership was formed. > > > Apr 13 11:20:34 corosync [pcmk ] info: update_member: > > 0x545dd00 Node > > > 71056300 (mq006.back.int.cwwtf.local) born on: 648 > > > Apr 13 11:20:34 corosync [pcmk ] info: update_member: > > 0x545dd00 Node > > > 71056300 now known as mq006.back.int.cwwtf.local (was: > > (null)) > > > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node > > > mq006.back.int.cwwtf.local now has process list: > > > 00000000000000000000000000013312 (78610) > > > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node > > > mq006.back.int.cwwtf.local now has 1 quorum votes (was 0) > > > Apr 13 11:20:34 corosync [pcmk ] info: > > send_member_notification: > > > Sending membership update 648 to 2 children > > > Apr 13 11:20:34 corosync [MAIN ] Completed service > > synchronization, > > > ready to provide service. > > > Apr 13 11:23:34 corosync [TOTEM ] Marking seqid 6843 ringid > > 0 > > > interface 172.59.60.3 FAULTY - adminisrtative intervention > > required. > > > Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0 interface > > > 172.59.60.3 FAULTY - adminisrtative intervention required. > > > Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0 interface > > > 172.59.60.3 FAULTY - adminisrtative intervention required. > > > Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0 interface > > > 172.59.60.3 FAULTY - adminisrtative intervention required. > > > > > > > > > Can anyone help me out with this? Am I doing something > > wrong or have > > > I found a bug? > > > > > > Cheers, > > > Tom > > > > > _______________________________________________ > > > Openais mailing list > > > [email protected] > > > https://lists.linux-foundation.org/mailman/listinfo/openais > > > > > >
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
