Greetings, My apologies for the lengthy first message to the list, but I'm at my wits end, and prefer to supply too much information instead of too little. Ha-debug is included as a link at the end of this message.
I've got a fresh pair for ubuntu boxes (7.10) I'm trying to get heartbeat up and running on. Both machines are identical, communication has been verified on eth0 and eth1, unicast traffic appears functional on eth1. Some background info: node1: ldirector01.EQX eth0: 192.168.38.25/24 eth1: 192.168.43.25/24 node2: ldirector02.EQX eth0: 192.168.38.26/24 eth1: 192.168.43.26/24 VIP: 192.168.38.40/24 DNS entries return the IP address bound to eth0 for these hostnames. I've attached configurations to the end of the message, along with logs from the primary node. The problem is when I start heartbeat on either node the IP address defined in haresources isn't being bound to the system. I'm assuming it's going to come up as eth0:0 (and subsequent definitions in haresources are going to increment the alias by 1), however it isn't playing nice. I can manually bring up the IP address: [EMAIL PROTECTED]:/var/log# ifconfig eth0:0 up 192.168.38.40 netmask 255.255.255.0 SIOCSIFFLAGS: Cannot assign requested address (The SIOCSIFFLAGS error appears to be a bug in Ubuntu's ifup/ifdown script) However when I do this (and have heartbeat started on both nodes) and I attempt to fail over to the secondary node (either with /etc/init.d/heartbeat stop or simulating a power failure) the IP address does not get bound to the second node. To make things more confusing when I start heartbeat on the secondary node after manually binding the VIP up on the primary node heartbeat takes the VIP offline (ResourceManager appears to hate me, in ha-log, at 2007/12/13_10:04:57). I'm looking for suggestions on where to go from here, and why ResourceManager apparently only wants to remove IPs and not add them when it starts. Ha.cf: Node1: [EMAIL PROTECTED]:/etc/ha.d# cat ha.cf | grep -v \# debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility daemon keepalive 2 deadtime 30 warntime 10 initdead 120 udpport 694 ucast eth1 192.168.43.26 auto_failback on node ldirector01.EQX node ldirector02.EQX ping_group router_group 192.168.38.1 respawn hacluster /usr/lib/heartbeat/ipfail debug 1 Node2: [EMAIL PROTECTED]:/etc/ha.d# cat ha.cf | grep -v \# debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility daemon keepalive 2 deadtime 30 warntime 10 initdead 120 udpport 694 ucast eth1 192.168.43.25 auto_failback on node ldirector01.EQX node ldirector02.EQX ping_group router_group 192.168.38.1 respawn hacluster /usr/lib/heartbeat/ipfail debug 1 Haresources has only a single definition, super simple while testing: node1: ldirector02.EQX IPaddr::192.168.38.40/24/eth0 node2: ldirector01.EQX IPaddr::192.168.38.40/24/eth0 Authkeys are mode 600 on both, both using auth 3, both defined as an md5 on the same string. Logs: node1's /var/log/ha-log: heartbeat[15380]: 2007/12/13_09:54:44 info: AUTH: i=1: key = 0x6d9a98, auth=0x2ae8dd26a470, authname=crc heartbeat[15380]: 2007/12/13_09:54:44 info: AUTH: i=2: key = 0x6da468, auth=0x2ae8dd46def0, authname=sha1 heartbeat[15380]: 2007/12/13_09:54:44 info: AUTH: i=3: key = 0x6dae68, auth=0x2ae8dd66ee10, authname=md5 heartbeat[15380]: 2007/12/13_09:54:44 WARN: Core dumps could be lost if multiple dumps occur. heartbeat[15380]: 2007/12/13_09:54:44 WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability heartbeat[15380]: 2007/12/13_09:54:44 WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability heartbeat[15380]: 2007/12/13_09:54:44 info: Version 2 support: false heartbeat[15380]: 2007/12/13_09:54:44 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[15380]: 2007/12/13_09:54:44 info: ************************** heartbeat[15380]: 2007/12/13_09:54:44 info: Configuration validated. Starting heartbeat 2.1.2 heartbeat[15381]: 2007/12/13_09:54:44 info: heartbeat: version 2.1.2 heartbeat[15381]: 2007/12/13_09:54:44 info: Heartbeat generation: 1197490909 heartbeat[15381]: 2007/12/13_09:54:44 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[15381]: 2007/12/13_09:54:44 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[15381]: 2007/12/13_09:54:44 info: Removing /var/run/heartbeat/rsctmp failed, recreating. heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: bound send socket to device: eth1 heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: bound receive socket to device: eth1 heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: started on port 694 interface eth1 to 192.168.43.26 heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ping group heartbeat started. heartbeat[15381]: 2007/12/13_09:54:44 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[15381]: 2007/12/13_09:54:44 info: Local status now set to: 'up' heartbeat[15381]: 2007/12/13_09:54:45 info: Link router_group:router_group up. heartbeat[15381]: 2007/12/13_09:54:45 info: Status update for node router_group: status ping <start heartbeat on secondary node> heartbeat[15381]: 2007/12/13_10:04:44 info: Daily informational memory statistics heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 101/680 ms age 0 [pid15381/MST_CONTROL] heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 3460/18414 383248/179790 [pid15381/MST_CONTROL] heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 397472 total malloc bytes. pid [15381/MST_CONTROL] heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0 heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/2 ms age 479440 [pid15385/HBFIFO] heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 371/458 45524/21281 [pid15385/HBFIFO] heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 48096 total malloc bytes. pid [15385/HBFIFO] heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0 heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/0 ms age 17234757580 [pid15386/HBWRITE] heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 372/794 45808/21481 [pid15386/HBWRITE] heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 54488 total malloc bytes. pid [15386/HBWRITE] heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0 heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/0 ms age 17234757580 [pid15387/HBREAD] heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 372/433 37680/17448 [pid15387/HBREAD] heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 37772 total malloc bytes. pid [15387/HBREAD] heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0 heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/649 ms age 1960 [pid15388/HBWRITE] heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 374/17080 45992/21609 [pid15388/HBWRITE] heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 59820 total malloc bytes. pid [15388/HBWRITE] heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0 heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/306 ms age 1960 [pid15389/HBREAD] heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 375/6556 46084/21673 [pid15389/HBREAD] heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 48220 total malloc bytes. pid [15389/HBREAD] heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0 heartbeat[15381]: 2007/12/13_10:04:44 info: These are nothing to worry about. heartbeat[15381]: 2007/12/13_10:04:55 info: Link ldirector02.eqx:eth1 up. heartbeat[15381]: 2007/12/13_10:04:55 info: Link ldirector02.eqx:eth1 up. heartbeat[15381]: 2007/12/13_10:04:55 info: Status update for node ldirector02.eqx: status init heartbeat[15381]: 2007/12/13_10:04:55 info: Status update for node ldirector02.eqx: status up harc[15463]: 2007/12/13_10:04:55 info: Running /etc/ha.d/rc.d/status status heartbeat[15381]: 2007/12/13_10:04:55 info: Exiting status process 15463 returned rc 0. harc[15472]: 2007/12/13_10:04:55 info: Running /etc/ha.d/rc.d/status status heartbeat[15381]: 2007/12/13_10:04:55 info: Exiting status process 15472 returned rc 0. heartbeat[15381]: 2007/12/13_10:04:56 info: Status update for node ldirector02.eqx: status active heartbeat[15381]: 2007/12/13_10:04:56 info: all clients are now paused heartbeat[15381]: 2007/12/13_10:04:56 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) harc[15480]: 2007/12/13_10:04:56 info: Running /etc/ha.d/rc.d/status status heartbeat[15381]: 2007/12/13_10:04:56 info: Exiting status process 15480 returned rc 0. heartbeat[15381]: 2007/12/13_10:04:57 info: other_holds_resources: 0 heartbeat[15381]: 2007/12/13_10:04:57 info: remote resource transition completed. heartbeat[15381]: 2007/12/13_10:04:57 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[15381]: 2007/12/13_10:04:57 info: ldirector01.eqx wants to go standby [foreign] heartbeat[15381]: 2007/12/13_10:04:57 info: i_hold_resources: 3 heartbeat[15381]: 2007/12/13_10:04:57 info: New standby state: 1 heartbeat[15381]: 2007/12/13_10:04:57 info: other_holds_resources: 0 heartbeat[15381]: 2007/12/13_10:04:57 info: standby: ldirector02.eqx can take our foreign resources heartbeat[15381]: 2007/12/13_10:04:57 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[15381]: 2007/12/13_10:04:57 info: New standby state: 1 heartbeat[15488]: 2007/12/13_10:04:57 info: give up foreign HA resources (standby). heartbeat[15488]: 2007/12/13_10:04:57 info: go_standby: who: 1 resource set: foreign heartbeat[15488]: 2007/12/13_10:04:57 info: go_standby: (query/action): (otherkeys/givegroup) ResourceManager[15499]: 2007/12/13_10:04:57 info: Releasing resource group: ldirector02.eqx IPaddr::192.168.38.40/24/eth0 ResourceManager[15499]: 2007/12/13_10:04:57 info: Running /etc/ha.d/resource.d/IPaddr 192.168.38.40/24/eth0 stop IPaddr[15533]: 2007/12/13_10:04:57 info: /sbin/route -n del -host 192.168.38.40 IPaddr[15533]: 2007/12/13_10:04:57 info: /sbin/ifconfig eth0:0 down IPaddr[15533]: 2007/12/13_10:04:57 info: IP Address 192.168.38.40 released heartbeat[15488]: 2007/12/13_10:04:57 info: foreign HA resource release completed (standby). heartbeat[15488]: 2007/12/13_10:04:57 info: FIFO message [type ask_resources] written rc=51 heartbeat[15381]: 2007/12/13_10:04:57 info: Local standby process completed [foreign]. heartbeat[15381]: 2007/12/13_10:04:57 info: New standby state: 3 heartbeat[15381]: 2007/12/13_10:04:57 info: Exiting go_standby process 15488 returned rc 0. heartbeat[15381]: 2007/12/13_10:04:58 info: all clients are now resumed heartbeat[15381]: 2007/12/13_10:04:58 WARN: 1 lost packet(s) for [ldirector02.eqx] [12:14] heartbeat[15381]: 2007/12/13_10:04:58 info: remote resource transition completed. heartbeat[15381]: 2007/12/13_10:04:58 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[15381]: 2007/12/13_10:04:58 info: other_holds_resources: 1 heartbeat[15381]: 2007/12/13_10:04:58 info: No pkts missing from ldirector02.eqx! heartbeat[15381]: 2007/12/13_10:04:58 info: Other node completed standby takeover of foreign resources. heartbeat[15381]: 2007/12/13_10:04:58 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[15381]: 2007/12/13_10:04:58 info: New standby state: 0 heartbeat[15381]: 2007/12/13_10:04:58 info: other_holds_resources: 1 /varlog/ha-debug.log: http://jalons.net/ha-debug.log -- Jeremy Alons Systems Administrator 866 839 1100 ext 3286 773 435 3286 direct 773 435 3232 fax thinkorswim,inc. 600 West Chicago Ave, Suite #100 Chicago, IL 60610 Member FINRA/SIPC/NFA trademark, all rights reserved ------------------------------ This e-mail is sent by a financial firm and contains information that may be privileged and confidential. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
