On Mon, Feb 20, 2012 at 12:44 PM, Anlu Wang <a...@mixpanel.com> wrote: > I have three servers that I'm trying to create IP failover on with > heartbeat. I have three IPs, one for each machine, and I want an IP to be > assigned to a different machine when it goes down. This is all working > splendidly. > > But in addition, I also want an IP to be assigned to a different machine > when either the internal OR external network interface goes down. To do > this, I have a ping resource on each machine that pings the other 2 machines > internal and external ips (so 4 IPs total being pinged on each machine). > This is where I'm having problems. > > When I take down a network interface manually with ifdown, sometimes it > fails to stop IP resources on the machines. This is what crm_mon outputs: > > ============ > Last updated: Sun Feb 19 19:29:53 2012 > Stack: Heartbeat > Current DC: anlutest2 (32769730-5e5e-40d6-baa0-9748131232da) - partition > with quorum > Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd > 3 Nodes configured, unknown expected votes > 6 Resources configured. > ============ > > Online: [ anlutest1 anlutest3 anlutest2 ] > > address01 (ocf::heartbeat:IPaddr2): Started anlutest2 > (unmanaged) FAILED > address02 (ocf::heartbeat:IPaddr2): Started anlutest3 > address03 (ocf::heartbeat:IPaddr2): Started anlutest1 > (unmanaged) FAILED > ping01 (ocf::pacemaker:ping): Started anlutest1 > ping02 (ocf::pacemaker:ping): Started anlutest2 > ping03 (ocf::pacemaker:ping): Started anlutest3 > > Failed actions: > address01_stop_0 (node=anlutest2, call=454, rc=1, status=complete): > unknown error > address03_stop_0 (node=anlutest1, call=104, rc=1, status=complete): > unknown error > > The reason for this seems to be detailed in the syslog: > > Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: rsc:address03:104: stop > Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM > operation address01_monitor_5000 (call=100, status=1, cib-update=0, > confirmed=true) Cancelled > Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM > operation address03_monitor_5000 (call=102, status=1, cib-update=0, > confirmed=true) Cancelled > Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32350]: INFO: IP status = ok, > IP_CIP= > Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32351]: INFO: IP status = ok, > IP_CIP= > Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32354]: INFO: ip -f inet addr > delete 50.97.234.170/29 dev eth1 > Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32355]: INFO: ip -f inet addr > delete 50.97.234.172/29 dev eth1 > Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM > operation address01_stop_0 (call=103, rc=0, cib-update=135, confirmed=true) > ok > Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: RA output: > (address03:stop:stderr) RTNETLINK answers: Cannot assign requested address > Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM > operation address03_stop_0 (call=104, rc=1, cib-update=136, confirmed=true) > unknown error > Feb 19 19:25:07 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush > message from anlutest2 > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update > relayed from anlutest2 > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update: > Sending flush op to all hosts for: fail-count-address03 (INFINITY) > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent > update 377: fail-count-address03=INFINITY > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update > relayed from anlutest2 > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update: > Sending flush op to all hosts for: last-failure-address03 (1329701107) > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent > update 379: last-failure-address03=1329701107 > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush > message from anlutest2 > Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush > message from anlutest2 > > But I have no idea what the RTNETLINK error is. Googling around seems to > show some issues about Ubuntu wireless drivers, but these interfaces are all > wired. Does anyone have any idea what is going on? I suspect there might be > some sort of weird IP assigning going on, due to the pingd resource not > reporting their scores all at the same time maybe?
Shouldn't be. The question is, why would we be /assigning/ an IP during a /stop/ action. > > When I manually go and cleanup the failed nodes, they get properly assigned > to the nodes that aren't down, so if we can't resolve the underlying issue, > is there a way to automatically attempt to cleanup failed resources a > limited number of times? I don't think you want to start the IP somewhere else if its still active on the original node. > > My configuration is here, in case there's anything wrong with it. Looks like you forgot to attach it. > > Anlu > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org