Hi,
I have a weird heartbeat problem and am hoping that someone on this
email list can point to something
that I am doing wrong.
I have drbd+heartbeat+nfs running over 2 systems installed with CentOS
5.1 operating system.
My ha.cf is:
debugfile /var/log/ha-debug.log
logfile /var/log/ha.log
logfacility local0
keepalive 2
warntime 10
deadtime 20
initdead 75
bcast eth0
auto_failback off
node watchdog-client1
node watchdog-client2
My drbd.conf is
global {
usage-count yes;
}
common {
protocol C;
}
resource r0 {
startup {
wfc-timeout 0;
degr-wfc-timeout 120; # 2 minutes.
}
disk {
on-io-error detach;
}
net {
timeout 60;
connect-int 10;
ping-int 10;
max-buffers 4096;
max-epoch-size 4096;
}
syncer {
rate 500M;
al-extents 257;
}
on watchdog-client1 {
device /dev/drbd1;
disk /dev/sdb1;
address 10.0.38.40:7789;
meta-disk /dev/sda6[0];
}
on watchdog-client2 {
device /dev/drbd1;
disk /dev/sdb1;
address 10.0.38.43:7789;
meta-disk /dev/sda6[0];
}
}
This is my haresources file:
watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0
Filesystem::/dev/drbd1:://mnt/data::ext3 kill nfs Delay::3::0 nfs
nfslock
watchdog-client1 is the initial Primary node and client2 is the
secondary node.
When I shutdown client1, client2 becomes the primary and mounts the file
system as expected.
Everything works like a charm!!!
Now, when I power back client1, it never becomes the secondary and the
file system on client2 for some reason gets un-mounted and never gets
mounted again.
The logs on the client1 machine, tell me that it's trying to mount the
file system twice (as shown in bold below) and it fails the second time,
since the file system is already mounted the first time successfully.
Since, it fails the second time, the heartbeat resource just stops all
its other processes.
Below is my client1 logs:
heartbeat[2542]: 2008/07/14_09:54:48 info: Received shutdown notice from
'watchdog-client2'.
heartbeat[2542]: 2008/07/14_09:54:48 info: Resources being acquired from
watchdog-client2.
heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq():
child count 1
heartbeat[2847]: 2008/07/14_09:54:48 info: acquire all HA resources
(standby).
ResourceManager[2873]: 2008/07/14_09:54:48 info: Acquiring resource
group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0
Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0
nfs nfslock
IPaddr[2915]: 2008/07/14_09:54:48 INFO: Resource is stopped
IPaddr[2929]: 2008/07/14_09:54:48 INFO: Resource is stopped
heartbeat[2848]: 2008/07/14_09:54:48 info: Local Resource acquisition
completed.
heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq():
child count 2
heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq():
child count 1
ResourceManager[2873]: 2008/07/14_09:54:48 info: Running
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start
ResourceManager[2873]: 2008/07/14_09:54:48 debug: Starting
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start
IPaddr[3072]: 2008/07/14_09:54:48 INFO: Using calculated netmask for
10.0.38.71: 255.255.255.0
IPaddr[3072]: 2008/07/14_09:54:48 DEBUG: Using calculated broadcast
for 10.0.38.71: 10.0.38.255
IPaddr[3072]: 2008/07/14_09:54:48 INFO: eval ifconfig eth0:0
10.0.38.71 netmask 255.255.255.0 broadcast 10.0.38.255
IPaddr[3072]: 2008/07/14_09:54:48 DEBUG: Sending Gratuitous Arp for
10.0.38.71 on eth0:0 [eth0]
IPaddr[3043]: 2008/07/14_09:54:48 INFO: Success
INFO: Success
ResourceManager[2873]: 2008/07/14_09:54:48 debug:
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start done. RC=0
ResourceManager[2873]: 2008/07/14_09:54:48 info: Running
/etc/ha.d/resource.d/drbddisk r0 start
ResourceManager[2873]: 2008/07/14_09:54:48 debug: Starting
/etc/ha.d/resource.d/drbddisk r0 start
ResourceManager[2873]: 2008/07/14_09:54:48 debug:
/etc/ha.d/resource.d/drbddisk r0 start done. RC=0
Delay[3231]: 2008/07/14_09:54:51 INFO: Delay is stopped
Delay[3220]: 2008/07/14_09:54:51 INFO: Resource is stopped
ResourceManager[2873]: 2008/07/14_09:54:51 info: Running
/etc/ha.d/resource.d/Delay 3 0 start
ResourceManager[2873]: 2008/07/14_09:54:51 debug: Starting
/etc/ha.d/resource.d/Delay 3 0 start
Delay[3278]: 2008/07/14_09:54:54 INFO: Success
INFO: Success
ResourceManager[2873]: 2008/07/14_09:54:54 debug:
/etc/ha.d/resource.d/Delay 3 0 start done. RC=0
Filesystem[3325]: 2008/07/14_09:54:54 INFO: Resource is stopped
ResourceManager[2873]: 2008/07/14_09:54:55 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
ResourceManager[2873]: 2008/07/14_09:54:55 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
Filesystem[3406]: 2008/07/14_09:54:55 INFO: Running start for
/dev/drbd1 on //mnt/data
Filesystem[3395]: 2008/07/14_09:54:55 INFO: Success
INFO: Success
ResourceManager[2873]: 2008/07/14_09:54:55 debug:
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done.
RC=0
ResourceManager[2873]: 2008/07/14_09:54:55 info: Running
/etc/init.d/killnfs start
ResourceManager[2873]: 2008/07/14_09:54:55 debug: Starting
/etc/init.d/killnfs start
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ...
or kill -l [sigspec]
ResourceManager[2873]: 2008/07/14_09:54:56 debug: /etc/init.d/killnfs
start done. RC=0
Delay[3515]: 2008/07/14_09:54:59 INFO: Delay is running OK
Delay[3504]: 2008/07/14_09:54:59 INFO: Running OK
ResourceManager[2873]: 2008/07/14_09:54:59 info: Running
/etc/init.d/nfs start
ResourceManager[2873]: 2008/07/14_09:54:59 debug: Starting
/etc/init.d/nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS daemon: [ OK ]
Starting NFS mountd: [ OK ]
Starting RPC idmapd: [ OK ]
ResourceManager[2873]: 2008/07/14_09:55:00 debug: /etc/init.d/nfs
start done. RC=0
ResourceManager[2873]: 2008/07/14_09:55:00 info: Running
/etc/init.d/nfslock start
ResourceManager[2873]: 2008/07/14_09:55:00 debug: Starting
/etc/init.d/nfslock start
Starting NFS statd: [ OK ]
ResourceManager[2873]: 2008/07/14_09:55:00 debug: /etc/init.d/nfslock
start done. RC=0
heartbeat[2847]: 2008/07/14_09:55:00 info: all HA resource acquisition
completed (standby).
heartbeat[2542]: 2008/07/14_09:55:00 info: Standby resource acquisition
done [all].
heartbeat[3753]: 2008/07/14_09:55:00 debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
harc[3753]: 2008/07/14_09:55:00 info: Running /etc/ha.d/rc.d/status
status
mach_down[3769]: 2008/07/14_09:55:00 info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources
acquired
mach_down[3769]: 2008/07/14_09:55:00 info: mach_down takeover
complete for node watchdog-client2.
heartbeat[2542]: 2008/07/14_09:55:00 info: mach_down takeover complete.
heartbeat[3803]: 2008/07/14_09:55:00 debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
harc[3803]: 2008/07/14_09:55:00 info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[3803]: 2008/07/14_09:55:00 received ip-request-resp
IPaddr::10.0.38.71/24/eth0 OK yes
ResourceManager[3824]: 2008/07/14_09:55:00 info: Acquiring resource
group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0
Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0
nfs nfslock
IPaddr[3851]: 2008/07/14_09:55:00 INFO: Running OK
Delay[3941]: 2008/07/14_09:55:03 INFO: Delay is running OK
Delay[3930]: 2008/07/14_09:55:03 INFO: Running OK
Filesystem[3978]: 2008/07/14_09:55:03 INFO: Resource is stopped
ResourceManager[3824]: 2008/07/14_09:55:03 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
ResourceManager[3824]: 2008/07/14_09:55:03 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
Filesystem[4059]: 2008/07/14_09:55:03 INFO: Running start for
/dev/drbd1 on //mnt/data
mount: /dev/drbd1 already mounted or //mnt/data busy
mount: according to mtab, /dev/drbd1 is mounted on /mnt/data
Filesystem[4059]: 2008/07/14_09:55:03 ERROR: Couldn't mount
filesystem /dev/drbd1 on //mnt/data
Filesystem[4048]: 2008/07/14_09:55:03 ERROR: Generic error
ERROR: Generic error
ResourceManager[3824]: 2008/07/14_09:55:03 debug:
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done.
RC=1
ResourceManager[3824]: 2008/07/14_09:55:03 ERROR: Return code 1 from
/etc/ha.d/resource.d/Filesystem
ResourceManager[3824]: 2008/07/14_09:55:03 CRIT: Giving up resources
due to failure of Filesystem::/dev/drbd1:://mnt/data::ext3
ResourceManager[3824]: 2008/07/14_09:55:03 info: Releasing resource
group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0
Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0
nfs nfslock
ResourceManager[3824]: 2008/07/14_09:55:03 info: Running
/etc/init.d/nfslock stop
ResourceManager[3824]: 2008/07/14_09:55:03 debug: Starting
/etc/init.d/nfslock stop
Stopping NFS locking: [ OK ]
Stopping NFS statd: [ OK ]
ResourceManager[3824]: 2008/07/14_09:55:03 debug: /etc/init.d/nfslock
stop done. RC=0
ResourceManager[3824]: 2008/07/14_09:55:03 info: Running
/etc/init.d/nfs stop
ResourceManager[3824]: 2008/07/14_09:55:03 debug: Starting
/etc/init.d/nfs stop
Shutting down NFS mountd: [ OK ]
Shutting down NFS daemon: [ OK ]
Shutting down NFS quotas: [ OK ]
Shutting down NFS services: [ OK ]
ResourceManager[3824]: 2008/07/14_09:55:04 debug: /etc/init.d/nfs stop
done. RC=0
ResourceManager[3824]: 2008/07/14_09:55:04 info: Running
/etc/ha.d/resource.d/Delay 3 0 stop
ResourceManager[3824]: 2008/07/14_09:55:04 debug: Starting
/etc/ha.d/resource.d/Delay 3 0 stop
Delay[4249]: 2008/07/14_09:55:04 INFO: Success
INFO: Success
ResourceManager[3824]: 2008/07/14_09:55:04 debug:
/etc/ha.d/resource.d/Delay 3 0 stop done. RC=0
ResourceManager[3824]: 2008/07/14_09:55:04 info: Running
/etc/init.d/killnfs stop
ResourceManager[3824]: 2008/07/14_09:55:04 debug: Starting
/etc/init.d/killnfs stop
ResourceManager[3824]: 2008/07/14_09:55:05 debug: /etc/init.d/killnfs
stop done. RC=0
ResourceManager[3824]: 2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop
ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop
Filesystem[4343]: 2008/07/14_09:55:05 INFO: Running stop for
/dev/drbd1 on //mnt/data
Filesystem[4332]: 2008/07/14_09:55:05 INFO: Success
INFO: Success
ResourceManager[3824]: 2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop done.
RC=0
ResourceManager[3824]: 2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/Delay 3 0 stop
ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/Delay 3 0 stop
Delay[4426]: 2008/07/14_09:55:05 INFO: Delay already stopped.
Delay[4415]: 2008/07/14_09:55:05 INFO: Success
INFO: Success
ResourceManager[3824]: 2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/Delay 3 0 stop done. RC=0
ResourceManager[3824]: 2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/drbddisk r0 stop
ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/drbddisk r0 stop
ResourceManager[3824]: 2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/drbddisk r0 stop done. RC=0
ResourceManager[3824]: 2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop
ResourceManager[3824]: 2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop
In IP Stop
SIOCDELRT: No such process
IPaddr[4522]: 2008/07/14_09:55:05 INFO: ifconfig eth0:0 down
IPaddr[4493]: 2008/07/14_09:55:05 INFO: Success
INFO: Success
ResourceManager[3824]: 2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop done. RC=0
heartbeat[2542]: 2008/07/14_09:55:09 WARN: node watchdog-client2: is
dead
heartbeat[2542]: 2008/07/14_09:55:09 info: Dead node watchdog-client2
gave up resources.
heartbeat[2542]: 2008/07/14_09:55:09 info: Link watchdog-client2:eth0
dead.
hb_standby[4564]: 2008/07/14_09:55:35 Going standby [foreign].
heartbeat[2542]: 2008/07/14_09:55:35 info: watchdog-client1 wants to go
standby [foreign]
heartbeat[2542]: 2008/07/14_09:55:45 WARN: No reply to standby request.
Standby request cancelled.
[EMAIL PROTECTED] log]#
Any idea why it's trying to mount the file system twice? I think that's
what's causing the problem.
Also, on a reboot, shouldn't client1 become the secondary and client 2
stay as the primary node?
If anyone needs any other information, please let me know. I'm trying to
fight this problem un-successfully for the past 2-3 days now.
Thanks,
Nikhil
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems