Following various sets of documentation I have found on the web I have
configured a two node HA/DRDB/NFS setup. Almost everything works without
any issues. Running on SLES 10; here are the conf files:
cat /etc/ha.d/ha.cf
auto_failback on
node ulccnfs01
node ulccnfs02
ucast eth1 10.1.66.110
udpport 1337
keepalive 2
#deadtime 30
deadtime 15
warntime 10
initdead 120
logfacility local0
debugfile /var/log/ha-log
logfile /var/log/ha-log
#auto_failback off
cat /etc/ha.d/haresources
ulccnfs01 10.1.100.140 drbddisk::drbd-resource-0
Filesystem::/dev/drbd0::/images::ext3 nfsserver
cat /etc/drbd.conf
resource drbd-resource-0 {
protocol C;
incon-degr-cmd "halt -f"; # killall heartbeat would be a good
alternative :->
startup {
degr-wfc-timeout 10; # 2 minutes
}
disk {
on-io-error detach;
}
syncer {
rate 10M; # Note: 'M' is MegaBytes, not MegaBits
}
on ulccnfs01 {
device /dev/drbd0;
disk /dev/cciss/c0d1p1;
address 10.1.100.135:7789;
meta-disk internal;
}
on ulccnfs02 {
device /dev/drbd0;
disk /dev/sdb1;
address 10.1.100.170:7789;
meta-disk internal;
}
}
Failover works without an issue. I can shutdown ulccnfs01 and ulccnfs02
grabs the resources without blinking. However when ulccnfs01 comes back up
it attempts to grab the resources, ulccnfs02 lets them go, but they fail
and ulccnfs02 does not pick them back up. Here is where it gets strange.
Immediately after reboot I can issue a restart via /etc/init.d/heartbeat
restart and everything works.
Here are relevant pieces of the debug log:
heartbeat: 2007/06/19_10:01:08 info: Sending Gratuitous Arp for
10.1.100.140 on eth0:0 [eth0]
heartbeat: 2007/06/19_10:01:08 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p
/var/lib/heartbeat/rsctmp/send_a
rp/send_arp-10.1.100.140 eth0 10.1.100.140 auto 10.1.100.140 ffffffffffff
heartbeat: 2007/06/19_10:01:08 debug: /etc/ha.d/resource.d/IPaddr
10.1.100.140 start done. RC=0
heartbeat: 2007/06/19_10:01:08 info: Running /etc/ha.d/resource.d/drbddisk
drbd-resource-0 start
heartbeat: 2007/06/19_10:01:08 debug: Starting
/etc/ha.d/resource.d/drbddisk drbd-resource-0 start
heartbeat: 2007/06/19_10:01:09 debug: /etc/ha.d/resource.d/drbddisk
drbd-resource-0 start done. RC=0
heartbeat: 2007/06/19_10:01:09 info: Running /etc/init.d/nfsserver start
heartbeat: 2007/06/19_10:01:09 debug: Starting /etc/init.d/nfsserver start
Starting kernel based NFS serverexportfs: can't open /var/lib/nfs/rmtab
for reading
exportfs: could not open /var/lib/nfs/etab for locking
exportfs: can't lock /var/lib/nfs/etab for writing
/usr/sbin/rpc.nfsd: chdir(/var/lib/nfs) failed: No such file or directory
/usr/sbin/rpc.mountd: chdir(/var/lib/nfs) failed: No such file or
directory
..failed
heartbeat: 2007/06/19_10:01:09 debug: /etc/init.d/nfsserver start done.
RC=7
heartbeat: 2007/06/19_10:01:09 ERROR: Return code 7 from
/etc/init.d/nfsserver
heartbeat: 2007/06/19_10:01:09 CRIT: Giving up resources due to failure of
nfsserver
I am using a a linked /var/lib/nfs which is linked on the DRBD device so
that there is no hiccup on the servers mounting the share. I am wondering
if a sleep command in the haresources might give DRBD enough time to mount
its partition, if that is the problem, or anything else that may help. The
other concern is that it states it is giving up the resources but
ulccnfs02 never takes over from this point. Any and all help will be
appreciated.
The second issue is that after rebooting DBRD ends up in Primary/Uknown
and Secondary/Unknown. I know that this isn't the DRBD list but I thought
someone here might be able to give some advice.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems