Greetings,

My apologies for the lengthy first message to the list, but I'm at my wits
end, and prefer to supply too much information instead of too little.
Ha-debug is included as a link at the end of this message.

I've got a fresh pair for ubuntu boxes (7.10) I'm trying to get heartbeat up
and running on.  

Both machines are identical, communication has been verified on eth0 and
eth1, unicast traffic appears functional on eth1.

Some background info:
node1: ldirector01.EQX eth0: 192.168.38.25/24 eth1: 192.168.43.25/24
node2: ldirector02.EQX eth0: 192.168.38.26/24 eth1: 192.168.43.26/24
VIP: 192.168.38.40/24

DNS entries return the IP address bound to eth0 for these hostnames.  I've
attached configurations to the end of the message, along with logs from the
primary node.

The problem is when I start heartbeat on either node the IP address defined
in haresources isn't being bound to the system.  I'm assuming it's going to
come up as eth0:0 (and subsequent definitions in haresources are going to
increment the alias by 1), however it isn't playing nice.  I can manually
bring up the IP address:

[EMAIL PROTECTED]:/var/log# ifconfig eth0:0 up 192.168.38.40 netmask
255.255.255.0
SIOCSIFFLAGS: Cannot assign requested address

(The SIOCSIFFLAGS error appears to be a bug in Ubuntu's ifup/ifdown script)

However when I do this (and have heartbeat started on both nodes) and I
attempt to fail over to the secondary node (either with
/etc/init.d/heartbeat stop or simulating a power failure) the IP address
does not get bound to the second node.

To make things more confusing when I start heartbeat on the secondary node
after manually binding the VIP up on the primary node heartbeat takes the
VIP offline (ResourceManager appears to hate me, in ha-log, at
2007/12/13_10:04:57).

I'm looking for suggestions on where to go from here, and why
ResourceManager apparently only wants to remove IPs and not add them when it
starts.


Ha.cf:
Node1:
[EMAIL PROTECTED]:/etc/ha.d# cat ha.cf | grep -v \#
debugfile /var/log/ha-debug
logfile    /var/log/ha-log
logfacility    daemon
keepalive 2
deadtime 30
warntime 10
initdead 120
udpport    694
ucast eth1 192.168.43.26
auto_failback on
node    ldirector01.EQX
node    ldirector02.EQX
ping_group router_group 192.168.38.1
respawn hacluster /usr/lib/heartbeat/ipfail
debug 1

Node2:
[EMAIL PROTECTED]:/etc/ha.d# cat ha.cf | grep -v \#
debugfile /var/log/ha-debug
logfile    /var/log/ha-log
logfacility    daemon
keepalive 2
deadtime 30
warntime 10
initdead 120
udpport    694
ucast eth1 192.168.43.25
auto_failback on
node    ldirector01.EQX
node    ldirector02.EQX
ping_group router_group 192.168.38.1
respawn hacluster /usr/lib/heartbeat/ipfail
debug 1

Haresources has only a single definition, super simple while testing:
node1: ldirector02.EQX IPaddr::192.168.38.40/24/eth0
node2: ldirector01.EQX IPaddr::192.168.38.40/24/eth0

Authkeys are mode 600 on both, both using auth 3, both defined as an md5 on
the same string.

Logs:
node1's /var/log/ha-log:
heartbeat[15380]: 2007/12/13_09:54:44 info: AUTH: i=1: key = 0x6d9a98,
auth=0x2ae8dd26a470, authname=crc
heartbeat[15380]: 2007/12/13_09:54:44 info: AUTH: i=2: key = 0x6da468,
auth=0x2ae8dd46def0, authname=sha1
heartbeat[15380]: 2007/12/13_09:54:44 info: AUTH: i=3: key = 0x6dae68,
auth=0x2ae8dd66ee10, authname=md5
heartbeat[15380]: 2007/12/13_09:54:44 WARN: Core dumps could be lost if
multiple dumps occur.
heartbeat[15380]: 2007/12/13_09:54:44 WARN: Consider setting non-default
value in /proc/sys/kernel/core_pattern (or equivalent) for maximum
supportability
heartbeat[15380]: 2007/12/13_09:54:44 WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
heartbeat[15380]: 2007/12/13_09:54:44 info: Version 2 support: false
heartbeat[15380]: 2007/12/13_09:54:44 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[15380]: 2007/12/13_09:54:44 info: **************************
heartbeat[15380]: 2007/12/13_09:54:44 info: Configuration validated.
Starting heartbeat 2.1.2
heartbeat[15381]: 2007/12/13_09:54:44 info: heartbeat: version 2.1.2
heartbeat[15381]: 2007/12/13_09:54:44 info: Heartbeat generation: 1197490909
heartbeat[15381]: 2007/12/13_09:54:44 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[15381]: 2007/12/13_09:54:44 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[15381]: 2007/12/13_09:54:44 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: bound send socket
to device: eth1
heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: bound receive
socket to device: eth1
heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ucast: started on port 694
interface eth1 to 192.168.43.26
heartbeat[15381]: 2007/12/13_09:54:44 info: glib: ping group heartbeat
started.
heartbeat[15381]: 2007/12/13_09:54:44 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[15381]: 2007/12/13_09:54:44 info: Local status now set to: 'up'
heartbeat[15381]: 2007/12/13_09:54:45 info: Link router_group:router_group
up.
heartbeat[15381]: 2007/12/13_09:54:45 info: Status update for node
router_group: status ping

<start heartbeat on secondary node>

heartbeat[15381]: 2007/12/13_10:04:44 info: Daily informational memory
statistics
heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 101/680 ms age 0
[pid15381/MST_CONTROL]
heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 3460/18414
383248/179790 [pid15381/MST_CONTROL]
heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 397472 total
malloc bytes. pid [15381/MST_CONTROL]
heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0
heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/2 ms age 479440
[pid15385/HBFIFO]
heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 371/458
45524/21281 [pid15385/HBFIFO]
heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 48096 total
malloc bytes. pid [15385/HBFIFO]
heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0
heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/0 ms age
17234757580 [pid15386/HBWRITE]
heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 372/794
45808/21481 [pid15386/HBWRITE]
heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 54488 total
malloc bytes. pid [15386/HBWRITE]
heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0
heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/0 ms age
17234757580 [pid15387/HBREAD]
heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 372/433
37680/17448 [pid15387/HBREAD]
heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 37772 total
malloc bytes. pid [15387/HBREAD]
heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0
heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/649 ms age 1960
[pid15388/HBWRITE]
heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 374/17080
45992/21609 [pid15388/HBWRITE]
heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 59820 total
malloc bytes. pid [15388/HBWRITE]
heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0
heartbeat[15381]: 2007/12/13_10:04:44 info: MSG stats: 0/306 ms age 1960
[pid15389/HBREAD]
heartbeat[15381]: 2007/12/13_10:04:44 info: cl_malloc stats: 375/6556
46084/21673 [pid15389/HBREAD]
heartbeat[15381]: 2007/12/13_10:04:44 info: RealMalloc stats: 48220 total
malloc bytes. pid [15389/HBREAD]
heartbeat[15381]: 2007/12/13_10:04:44 info: Current arena value: 0
heartbeat[15381]: 2007/12/13_10:04:44 info: These are nothing to worry
about.
heartbeat[15381]: 2007/12/13_10:04:55 info: Link ldirector02.eqx:eth1 up.
heartbeat[15381]: 2007/12/13_10:04:55 info: Link ldirector02.eqx:eth1 up.
heartbeat[15381]: 2007/12/13_10:04:55 info: Status update for node
ldirector02.eqx: status init
heartbeat[15381]: 2007/12/13_10:04:55 info: Status update for node
ldirector02.eqx: status up
harc[15463]:    2007/12/13_10:04:55 info: Running /etc/ha.d/rc.d/status
status
heartbeat[15381]: 2007/12/13_10:04:55 info: Exiting status process 15463
returned rc 0.
harc[15472]:    2007/12/13_10:04:55 info: Running /etc/ha.d/rc.d/status
status
heartbeat[15381]: 2007/12/13_10:04:55 info: Exiting status process 15472
returned rc 0.
heartbeat[15381]: 2007/12/13_10:04:56 info: Status update for node
ldirector02.eqx: status active
heartbeat[15381]: 2007/12/13_10:04:56 info: all clients are now paused
heartbeat[15381]: 2007/12/13_10:04:56 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
harc[15480]:    2007/12/13_10:04:56 info: Running /etc/ha.d/rc.d/status
status
heartbeat[15381]: 2007/12/13_10:04:56 info: Exiting status process 15480
returned rc 0.
heartbeat[15381]: 2007/12/13_10:04:57 info: other_holds_resources: 0
heartbeat[15381]: 2007/12/13_10:04:57 info: remote resource transition
completed.
heartbeat[15381]: 2007/12/13_10:04:57 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[15381]: 2007/12/13_10:04:57 info: ldirector01.eqx wants to go
standby [foreign]
heartbeat[15381]: 2007/12/13_10:04:57 info: i_hold_resources: 3
heartbeat[15381]: 2007/12/13_10:04:57 info: New standby state: 1
heartbeat[15381]: 2007/12/13_10:04:57 info: other_holds_resources: 0
heartbeat[15381]: 2007/12/13_10:04:57 info: standby: ldirector02.eqx can
take our foreign resources
heartbeat[15381]: 2007/12/13_10:04:57 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[15381]: 2007/12/13_10:04:57 info: New standby state: 1
heartbeat[15488]: 2007/12/13_10:04:57 info: give up foreign HA resources
(standby).
heartbeat[15488]: 2007/12/13_10:04:57 info: go_standby: who: 1 resource set:
foreign
heartbeat[15488]: 2007/12/13_10:04:57 info: go_standby: (query/action):
(otherkeys/givegroup)
ResourceManager[15499]: 2007/12/13_10:04:57 info: Releasing resource group:
ldirector02.eqx IPaddr::192.168.38.40/24/eth0
ResourceManager[15499]: 2007/12/13_10:04:57 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.38.40/24/eth0 stop
IPaddr[15533]:  2007/12/13_10:04:57 info: /sbin/route -n del -host
192.168.38.40
IPaddr[15533]:  2007/12/13_10:04:57 info: /sbin/ifconfig eth0:0 down
IPaddr[15533]:  2007/12/13_10:04:57 info: IP Address 192.168.38.40 released
heartbeat[15488]: 2007/12/13_10:04:57 info: foreign HA resource release
completed (standby).
heartbeat[15488]: 2007/12/13_10:04:57 info: FIFO message [type
ask_resources] written rc=51
heartbeat[15381]: 2007/12/13_10:04:57 info: Local standby process completed
[foreign].
heartbeat[15381]: 2007/12/13_10:04:57 info: New standby state: 3
heartbeat[15381]: 2007/12/13_10:04:57 info: Exiting go_standby process 15488
returned rc 0.
heartbeat[15381]: 2007/12/13_10:04:58 info: all clients are now resumed
heartbeat[15381]: 2007/12/13_10:04:58 WARN: 1 lost packet(s) for
[ldirector02.eqx] [12:14]
heartbeat[15381]: 2007/12/13_10:04:58 info: remote resource transition
completed.
heartbeat[15381]: 2007/12/13_10:04:58 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[15381]: 2007/12/13_10:04:58 info: other_holds_resources: 1
heartbeat[15381]: 2007/12/13_10:04:58 info: No pkts missing from
ldirector02.eqx!
heartbeat[15381]: 2007/12/13_10:04:58 info: Other node completed standby
takeover of foreign resources.
heartbeat[15381]: 2007/12/13_10:04:58 info: AnnounceTakeover(local 1,
foreign 1, reason 'T_RESOURCES(us)' (1))
heartbeat[15381]: 2007/12/13_10:04:58 info: New standby state: 0
heartbeat[15381]: 2007/12/13_10:04:58 info: other_holds_resources: 1

/varlog/ha-debug.log: http://jalons.net/ha-debug.log

-- 
Jeremy Alons
Systems Administrator
866 839 1100 ext 3286
773 435 3286 direct
773 435 3232 fax

thinkorswim,inc.
600 West Chicago Ave, Suite #100
Chicago, IL 60610

Member FINRA/SIPC/NFA
trademark, all rights reserved
------------------------------
This e-mail is sent by a financial firm and contains information that may be
privileged and confidential.




_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to