I'm in the testing phase of building a HA cluster using Heartbeat and DRBD.
Both nodes are running the same configuration files, e.g. drbd.conf,
ha.cf, haresources.
The two nodes are connected with dual heartbeat connections:
server_one, eth1 192.168.1.136---> server_two, eth1 192.168.1.137
server_one, eth3 192.168.2.136---> server_two, eth3 192.168.2.137
Heartbeat is able to ping the gateway from both servers:
Jul 1 17:43:22 server_one heartbeat: [21192]: info: Link
www.xxx.yyy.129:www.xxx.yyy.129 up.
Jul 1 17:43:22 server_one heartbeat: [21192]: info: Status
update for node www.xxx.yyy.129: status ping
Jul 1 17:33:18 server_two heartbeat: [19332]: info: Link
www.xxx.yyy.129:www.xxx.yyy.129 up.
Jul 1 17:33:18 server_two heartbeat: [19332]: info: Status
update for node www.xxx.yyy.129: status ping
Both nodes are running the same version of the software:
[r...@server_one ~]# rpm -qa | grep -i heart
heartbeat-stonith-2.1.3-3.el5.centos
heartbeat-2.1.3-3.el5.centos
heartbeat-pils-2.1.3-3.el5.centos
[r...@server_one ~]# rpm -qa | grep -i drbd
drbd82-8.2.6-1.el5.centos
kmod-drbd82-8.2.6-2
[r...@server_one ~]# service heartbeat status
service heartbeat status
heartbeat OK [pid 21192 et al] is running on server_one [server_one]...
[r...@server_two ~]# service heartbeat status
heartbeat OK [pid 19332 et al] is running on server_two [server_two]...
When testing failover, I issue the command "service heartbeat standby"
but, failover does not occur and the message "Standby request
cancelled" appears in the logs of both servers:
Tell server_one to standby:
[r...@server_one ~]# service heartbeat standby
service heartbeat standby
auto_failback: off
Attempting to enter standby mode
2009/07/02_10:54:38 Going standby [all].
Jul 2 10:54:38 server_one heartbeat: [21192]: info: server_one wants
to go standby [all]
Jul 2 10:54:48 server_one heartbeat: [21192]: WARN: No reply to
standby request. Standby request cancelled.
Tell server_two to standby:
[r...@server_two ~]# service heartbeat standby
auto_failback: off
Attempting to enter standby mode
2009/07/02_10:55:16 Going standby [all].
[r...@server_two ~]# [ OK ]
Jul 2 10:55:16 server_two heartbeat: [19332]: info: server_two wants
to go standby [all]
Jul 2 10:55:28 server_two heartbeat: [19332]: WARN: No reply to
standby request. Standby request cancelled.
Here is the log from server_one during a restart of drbd and heartbeat:
[r...@server_two ~]# service drbd restart ; sleep 10 ; service
heartbeat restart
Jul 1 17:31:23 server_one kernel: drbd0: State change failed: Device
is held open by someone
Jul 1 17:31:23 server_one kernel: drbd0: state = { cs:WFConnection
st:Primary/Unknown ds:UpToDate/DUnknown r--- }
Jul 1 17:31:23 server_one kernel: drbd0: wanted = { cs:WFConnection
st:Secondary/Unknown ds:UpToDate/DUnknown r--- }
Jul 1 17:31:33 server_one heartbeat: [14355]: info: Heartbeat
shutdown in progress. (14355)
Jul 1 17:31:33 server_one heartbeat: [19482]: info: Giving up all HA
resources.
Jul 1 17:31:33 server_one ResourceManager[19495]: info: Releasing
resource group: server_one IPaddr::www.xxx.yyy.150/26/eth0
drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/home::ext3 sendmail
Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/init.d/sendmail stop
Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home ext3 stop
Jul 1 17:31:33 server_one Filesystem[19575]: INFO: Running stop for
/dev/drbd0 on /home
Jul 1 17:31:33 server_one Filesystem[19575]: INFO: Trying to unmount
/home
Jul 1 17:31:33 server_one Filesystem[19575]: INFO: unmounted /home
successfully
Jul 1 17:31:33 server_one Filesystem[19564]: INFO: Success
Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/ha.d/resource.d/drbddisk drbd-resource-0 stop
Jul 1 17:31:33 server_one kernel: drbd0: role( Primary -> Secondary )
Jul 1 17:31:33 server_one kernel: drbd0: Writing meta data super block
now.
Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/ha.d/resource.d/IPaddr www.xxx.yyy.150/26/eth0 stop
Jul 1 17:31:34 server_one IPaddr[19727]: INFO: ifconfig eth0:0 down
Jul 1 17:31:34 server_one avahi-daemon[7115]: Withdrawing address
record for www.xxx.yyy.150 on eth0.
Jul 1 17:31:34 server_one IPaddr[19698]: INFO: Success
Jul 1 17:31:34 server_one heartbeat: [19482]: info: All HA resources
relinquished.
Jul 1 17:31:35 server_one heartbeat: [14355]: info: killing
/usr/lib/heartbeat/ipfail process group 14366 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBFIFO
process 14357 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE
process 14358 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD
process 14359 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE
process 14360 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD
process 14361 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE
process 14362 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD
process 14363 with signal 15
Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process
14359 exited. 7 remaining
Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process
14363 exited. 6 remaining
Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process
14361 exited. 5 remaining
Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process
14358 exited. 4 remaining
Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process
14362 exited. 3 remaining
Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process
14357 exited. 2 remaining
Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process
14360 exited. 1 remaining
Jul 1 17:31:36 server_one heartbeat: [14355]: info: server_one
Heartbeat shutdown complete.
Jul 1 17:32:08 server_one kernel: drbd0: conn( WFConnection ->
Disconnecting )
Jul 1 17:32:08 server_one kernel: drbd0: Discarding network
configuration.
Jul 1 17:32:08 server_one kernel: drbd0: tl_clear()
Jul 1 17:32:08 server_one kernel: drbd0: Connection closed
Jul 1 17:32:08 server_one kernel: drbd0: conn( Disconnecting ->
StandAlone )
Jul 1 17:32:08 server_one kernel: drbd0: receiver terminated
Jul 1 17:32:08 server_one kernel: drbd0: Terminating receiver thread
Jul 1 17:32:08 server_one kernel: drbd0: disk( UpToDate -> Diskless )
Jul 1 17:32:08 server_one kernel: drbd0: drbd_bm_resize called with
capacity == 0
Jul 1 17:32:08 server_one kernel: drbd0: worker terminated
Jul 1 17:32:08 server_one kernel: drbd0: Terminating worker thread
Jul 1 17:32:08 server_one kernel: drbd: module cleanup done.
Jul 1 17:32:08 server_one kernel: drbd: initialised. Version: 8.2.6
(api:88/proto:86-88)
Jul 1 17:32:08 server_one kernel: drbd: GIT-hash:
3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
build...@c5-i386-build, 2008-10-03 11:42:32
Jul 1 17:32:08 server_one kernel: drbd: registered as block device
major 147
Jul 1 17:32:08 server_one kernel: drbd: minor_table @ 0xf7467240
Jul 1 17:32:08 server_one kernel: drbd0: disk( Diskless -> Attaching )
Jul 1 17:32:08 server_one kernel: drbd0: Starting worker thread
(from cqueue/1 [145])
Jul 1 17:32:08 server_one kernel: klogd 1.4.1, ---------- state
change ----------
Jul 1 17:32:08 server_one kernel: drbd0: Found 6 transactions (34
active extents) in activity log.
Jul 1 17:32:08 server_one kernel: drbd0: max_segment_size ( = BIO
size ) = 32768
Jul 1 17:32:08 server_one kernel: drbd0: drbd_bm_resize called with
capacity == 832761342
Jul 1 17:32:08 server_one kernel: drbd0: resync bitmap:
bits=104095168 words=3252974
Jul 1 17:32:08 server_one kernel: drbd0: size = 397 GB (416380671 KB)
Jul 1 17:32:08 server_one kernel: drbd0: reading of bitmap took 21
jiffies
Jul 1 17:32:08 server_one kernel: drbd0: recounting of set bits took
additional 14 jiffies
Jul 1 17:32:08 server_one kernel: drbd0: 397 GB (104095168 bits)
marked out-of-sync by on disk bit-map.
Jul 1 17:32:08 server_one kernel: drbd0: disk( Attaching -> UpToDate )
Jul 1 17:32:08 server_one kernel: drbd0: Writing meta data super block
now.
Jul 1 17:32:08 server_one kernel: drbd0: conn( StandAlone ->
Unconnected )
Jul 1 17:32:08 server_one kernel: drbd0: Starting receiver thread
(from drbd0_worker [19793])
Jul 1 17:32:08 server_one kernel: drbd0: receiver (re)started
Jul 1 17:32:08 server_one kernel: drbd0: conn( Unconnected ->
WFConnection )
Jul 1 17:36:20 server_one kernel: device eth1 entered promiscuous mode
Jul 1 17:37:01 server_one heartbeat: [20041]: info: Version 2 support:
false
Jul 1 17:37:01 server_one heartbeat: [20041]: WARN: Logging daemon
is disabled --enabling logging daemon is recommended
Jul 1 17:37:01 server_one heartbeat: [20041]: info:
**************************
Jul 1 17:37:01 server_one heartbeat: [20041]: info: Configuration
validated. Starting heartbeat 2.1.3
Jul 1 17:37:01 server_one heartbeat: [20042]: info: heartbeat: version
2.1.3
Jul 1 17:37:01 server_one heartbeat: [20042]: info: Heartbeat
generation: 1245962690
Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat started on port 694 (694) interface eth1
Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat started on port 694 (694) interface eth3
Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth3 - Status: 1
Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: ping
heartbeat started.
Jul 1 17:37:01 server_one heartbeat: [20042]: info:
G_main_add_TriggerHandler: Added signal manual handler
Jul 1 17:37:01 server_one heartbeat: [20042]: info:
G_main_add_TriggerHandler: Added signal manual handler
Jul 1 17:37:01 server_one heartbeat: [20042]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Jul 1 17:37:01 server_one heartbeat: [20042]: info: Local status now
set to: 'up'
Jul 1 17:37:02 server_one heartbeat: [20042]: info: Link
www.xxx.yyy.129:www.xxx.yyy.129 up.
Jul 1 17:37:02 server_one heartbeat: [20042]: info: Status update
for node www.xxx.yyy.129: status ping
Jul 1 17:37:15 server_one kernel: device eth1 left promiscuous mode
Jul 1 17:37:32 server_one heartbeat: [20042]: WARN: node server_two:
is dead
Jul 1 17:37:32 server_one heartbeat: [20042]: info: Comm_now_up():
updating status to active
Jul 1 17:37:32 server_one heartbeat: [20042]: info: Local status now
set to: 'active'
Jul 1 17:37:32 server_one heartbeat: [20042]: info: Starting child
client "/usr/lib/heartbeat/ipfail" (498,496)
Jul 1 17:37:32 server_one heartbeat: [20042]: WARN: No STONITH
device configured.
Jul 1 17:37:32 server_one heartbeat: [20042]: WARN: Shared disks are
not protected.
Jul 1 17:37:32 server_one heartbeat: [20042]: info: Resources being
acquired from server_two.
Jul 1 17:37:32 server_one heartbeat: [20055]: info: Starting
"/usr/lib/heartbeat/ipfail" as uid 498 gid 496 (pid 20055)
Jul 1 17:37:32 server_one harc[20056]: info: Running
/etc/ha.d/rc.d/status status
Jul 1 17:37:32 server_one mach_down[20086]: info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources
acquired
Jul 1 17:37:32 server_one mach_down[20086]: info: mach_down takeover
complete for node server_two.
Jul 1 17:37:32 server_one heartbeat: [20042]: info: Initial resource
acquisition complete (T_RESOURCES(us))
Jul 1 17:37:32 server_one heartbeat: [20042]: info: mach_down
takeover complete.
Jul 1 17:37:32 server_one IPaddr[20129]: INFO: Resource is stopped
Jul 1 17:37:32 server_one heartbeat: [20057]: info: Local Resource
acquisition completed.
Jul 1 17:37:32 server_one harc[20193]: info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
Jul 1 17:37:32 server_one ip-request-resp[20193]: received
ip-request-resp IPaddr::www.xxx.yyy.150/26/eth0 OK yes
Jul 1 17:37:32 server_one ResourceManager[20214]: info: Acquiring
resource group: server_one IPaddr::www.xxx.yyy.150/26/eth0
drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/home::ext3 sendmail
Jul 1 17:37:32 server_one IPaddr[20241]: INFO: Resource is stopped
Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/ha.d/resource.d/IPaddr www.xxx.yyy.150/26/eth0 start
Jul 1 17:37:32 server_one IPaddr[20339]: INFO: Using calculated
netmask for www.xxx.yyy.150: 255.255.255.192
Jul 1 17:37:32 server_one IPaddr[20339]: INFO: eval ifconfig eth0:0
www.xxx.yyy.150 netmask 255.255.255.192 broadcast www.xxx.yyy.191
Jul 1 17:37:32 server_one avahi-daemon[7115]: Registering new
address record for www.xxx.yyy.150 on eth0.
Jul 1 17:37:32 server_one avahi-daemon[7115]: Withdrawing address
record for www.xxx.yyy.150 on eth0.
Jul 1 17:37:32 server_one avahi-daemon[7115]: Registering new
address record for www.xxx.yyy.150 on eth0.
Jul 1 17:37:32 server_one IPaddr[20310]: INFO: Success
Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/ha.d/resource.d/drbddisk drbd-resource-0 start
Jul 1 17:37:32 server_one kernel: drbd0: role( Secondary -> Primary )
Jul 1 17:37:32 server_one kernel: drbd0: Writing meta data super block
now.
Jul 1 17:37:32 server_one Filesystem[20487]: INFO: Resource is stopped
Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home ext3 start
Jul 1 17:37:32 server_one Filesystem[20568]: INFO: Running start for
/dev/drbd0 on /home
Jul 1 17:37:32 server_one kernel: kjournald starting. Commit
interval 5 seconds
Jul 1 17:37:32 server_one kernel: EXT3 FS on drbd0, internal journal
Jul 1 17:37:32 server_one kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Jul 1 17:37:32 server_one Filesystem[20557]: INFO: Success
Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/init.d/sendmail start
Jul 1 17:37:38 server_one heartbeat: [20042]: WARN: standby message
[me] from server_one ignored. local resources in flux.
Jul 1 17:37:42 server_one heartbeat: [20042]: info: Local Resource
acquisition completed. (none)
Jul 1 17:37:42 server_one heartbeat: [20042]: info: local resource
transition completed.
There is another cluster that was built about a year ago running the
same versions of heartbeat and drbd. I've configured this system in
the same way except for IP address changes, so it should work.
What am I overlooking here? What else can I try to troubleshoot this
and get the failover resolved?
Thanks,
=======Keith
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems