I have three servers that I'm trying to create IP failover on with heartbeat. I have three IPs, one for each machine, and I want an IP to be assigned to a different machine when it goes down. This is all working splendidly.
But in addition, I also want an IP to be assigned to a different machine when either the internal OR external network interface goes down. To do this, I have a ping resource on each machine that pings the other 2 machines internal and external ips (so 4 IPs total being pinged on each machine). This is where I'm having problems. When I take down a network interface manually with ifdown, sometimes it fails to stop IP resources on the machines. This is what crm_mon outputs: ============ Last updated: Sun Feb 19 19:29:53 2012 Stack: Heartbeat Current DC: anlutest2 (32769730-5e5e-40d6-baa0-9748131232da) - partition with quorum Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 3 Nodes configured, unknown expected votes 6 Resources configured. ============ Online: [ anlutest1 anlutest3 anlutest2 ] address01 (ocf::heartbeat:IPaddr2): Started anlutest2 (unmanaged) FAILED address02 (ocf::heartbeat:IPaddr2): Started anlutest3 address03 (ocf::heartbeat:IPaddr2): Started anlutest1 (unmanaged) FAILED ping01 (ocf::pacemaker:ping): Started anlutest1 ping02 (ocf::pacemaker:ping): Started anlutest2 ping03 (ocf::pacemaker:ping): Started anlutest3 Failed actions: address01_stop_0 (node=anlutest2, call=454, rc=1, status=complete): unknown error address03_stop_0 (node=anlutest1, call=104, rc=1, status=complete): unknown error The reason for this seems to be detailed in the syslog: Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: rsc:address03:104: stop Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM operation address01_monitor_5000 (call=100, status=1, cib-update=0, confirmed=true) Cancelled Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM operation address03_monitor_5000 (call=102, status=1, cib-update=0, confirmed=true) Cancelled Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32350]: INFO: IP status = ok, IP_CIP= Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32351]: INFO: IP status = ok, IP_CIP= Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32354]: INFO: ip -f inet addr delete 50.97.234.170/29 dev eth1 Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32355]: INFO: ip -f inet addr delete 50.97.234.172/29 dev eth1 Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM operation address01_stop_0 (call=103, rc=0, cib-update=135, confirmed=true) ok Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: RA output: (address03:stop:stderr) RTNETLINK answers: Cannot assign requested address Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM operation address03_stop_0 (call=104, rc=1, cib-update=136, confirmed=true) unknown error Feb 19 19:25:07 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush message from anlutest2 Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update relayed from anlutest2 Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-address03 (INFINITY) Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent update 377: fail-count-address03=INFINITY Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update relayed from anlutest2 Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-address03 (1329701107) Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent update 379: last-failure-address03=1329701107 Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush message from anlutest2 Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush message from anlutest2 But I have no idea what the RTNETLINK error is. Googling around seems to show some issues about Ubuntu wireless drivers, but these interfaces are all wired. Does anyone have any idea what is going on? I suspect there might be some sort of weird IP assigning going on, due to the pingd resource not reporting their scores all at the same time maybe? When I manually go and cleanup the failed nodes, they get properly assigned to the nodes that aren't down, so if we can't resolve the underlying issue, is there a way to automatically attempt to cleanup failed resources a limited number of times? My configuration is here, in case there's anything wrong with it. Anlu
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org