Following various sets of documentation I have found on the web I have 
configured a two node HA/DRDB/NFS setup. Almost everything works without 
any issues. Running on SLES 10; here are the conf files:

cat /etc/ha.d/ha.cf
auto_failback on
node ulccnfs01
node ulccnfs02
ucast eth1 10.1.66.110
udpport 1337
keepalive 2
#deadtime 30
deadtime 15
warntime 10
initdead 120
logfacility local0
debugfile /var/log/ha-log
logfile /var/log/ha-log
#auto_failback off

cat /etc/ha.d/haresources
ulccnfs01 10.1.100.140 drbddisk::drbd-resource-0 
Filesystem::/dev/drbd0::/images::ext3 nfsserver

cat /etc/drbd.conf
resource drbd-resource-0 {
  protocol C;
  incon-degr-cmd "halt -f"; # killall heartbeat would be a good 
alternative :->

  startup {
        degr-wfc-timeout 10;    # 2 minutes
  }

  disk {
    on-io-error detach;
  }

  syncer {
    rate 10M; # Note: 'M' is MegaBytes, not MegaBits
  }

  on ulccnfs01 {
    device    /dev/drbd0;
    disk      /dev/cciss/c0d1p1;
    address   10.1.100.135:7789;
    meta-disk  internal;
  }

  on ulccnfs02 {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    address   10.1.100.170:7789;
    meta-disk  internal;
  }
}

Failover works without an issue. I can shutdown ulccnfs01 and ulccnfs02 
grabs the resources without blinking. However when ulccnfs01 comes back up 
it attempts to grab the resources, ulccnfs02 lets them go, but they fail 
and ulccnfs02 does not pick them back up. Here is where it gets strange. 
Immediately after reboot I can issue a restart via /etc/init.d/heartbeat 
restart and everything works.

Here are relevant pieces of the debug log:

heartbeat: 2007/06/19_10:01:08 info: Sending Gratuitous Arp for 
10.1.100.140 on eth0:0 [eth0]
heartbeat: 2007/06/19_10:01:08 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p 
/var/lib/heartbeat/rsctmp/send_a
rp/send_arp-10.1.100.140 eth0 10.1.100.140 auto 10.1.100.140 ffffffffffff
heartbeat: 2007/06/19_10:01:08 debug: /etc/ha.d/resource.d/IPaddr 
10.1.100.140 start done. RC=0
heartbeat: 2007/06/19_10:01:08 info: Running /etc/ha.d/resource.d/drbddisk 
drbd-resource-0 start
heartbeat: 2007/06/19_10:01:08 debug: Starting 
/etc/ha.d/resource.d/drbddisk drbd-resource-0 start
heartbeat: 2007/06/19_10:01:09 debug: /etc/ha.d/resource.d/drbddisk 
drbd-resource-0 start done. RC=0
heartbeat: 2007/06/19_10:01:09 info: Running /etc/init.d/nfsserver  start
heartbeat: 2007/06/19_10:01:09 debug: Starting /etc/init.d/nfsserver start
Starting kernel based NFS serverexportfs: can't open /var/lib/nfs/rmtab 
for reading
exportfs: could not open /var/lib/nfs/etab for locking
exportfs: can't lock /var/lib/nfs/etab for writing
/usr/sbin/rpc.nfsd: chdir(/var/lib/nfs) failed: No such file or directory
/usr/sbin/rpc.mountd: chdir(/var/lib/nfs) failed: No such file or 
directory
..failed
heartbeat: 2007/06/19_10:01:09 debug: /etc/init.d/nfsserver  start done. 
RC=7
heartbeat: 2007/06/19_10:01:09 ERROR: Return code 7 from 
/etc/init.d/nfsserver
heartbeat: 2007/06/19_10:01:09 CRIT: Giving up resources due to failure of 
nfsserver

I am using a a linked /var/lib/nfs which is linked on the DRBD device so 
that there is no hiccup on the servers mounting the share. I am wondering 
if a sleep command in the haresources might give DRBD enough time to mount 
its partition, if that is the problem, or anything else that may help. The 
other concern is that it states it is giving up the resources but 
ulccnfs02 never takes over from this point. Any and all help will be 
appreciated.

The second issue is that after rebooting DBRD ends up in Primary/Uknown 
and Secondary/Unknown. I know that this isn't the DRBD list but I thought 
someone here might be able to give some advice.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to