[Linux-HA] Heartbeat + drbd + NFS: secondary node (after being primary) goes nuts

Paolo Pisati Tue, 20 Oct 2009 09:01:29 -0700

Dear guys,

i've a small problem with an NFS/drbd/heartbeat cluster: basically the 
secondary node (that was previously promoted
as primary) when the primary come up again, is unable to release the 
resources (ip/drbd) gracefully, and reboots.
I know there's resource stickiness with pacemaker, but i could find any 
info for plain heartbeat.
Moreover, is the behavior of node2 normal in my case? (config files are 
below).


But let's start from the beginning: i setup a cluster following (more or 
less) this guide 
(http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat)
on a Centos system.

# node1 is primary, works ok, etcetc
[p...@node1 ~]$ /sbin/ifconfig eth0:0
eth0:0    Link encap:Ethernet  HWaddr 00:25:64:3A:DE:68 
          inet addr:172.16.6.77  Bcast:172.16.6.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:169 Memory:dfdf0000-dfe00000
[p...@node1 ~]$ cat /proc/drbd
version: 8.0.16 (api:86/proto:86)
GIT-hash: d30881451c988619e243d6294a899139eed1183d build by 
[email protected], 2009-08-22 13:26:57
 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
    ns:20484 nr:0 dw:136 dr:20865 al:5 bm:6 lo:0 pe:0 ua:0 ap:0
        resync: used:0/61 hits:1279 misses:3 starving:0 dirty:0 changed:3
        act_log: used:0/257 hits:29 misses:5 starving:0 dirty:0 changed:5

# while node2 is secondary, waiting to take over
[p...@node2 ~]$ cat /proc/drbd
version: 8.0.16 (api:86/proto:86)
GIT-hash: d30881451c988619e243d6294a899139eed1183d build by 
[email protected], 2009-08-22 13:26:57
 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:20484 dw:20484 dr:0 al:0 bm:5 lo:0 pe:0 ua:0 ap:0
        resync: used:0/61 hits:1279 misses:3 starving:0 dirty:0 changed:3
        act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0

# to simulate a disaster, i ungracefully unplug power cord from node1, 
and node2 takes over
[p...@node2 ~]$ /sbin/ifconfig eth0:0
eth0:0    Link encap:Ethernet  HWaddr 00:25:64:3A:E3:15 
          inet addr:172.16.6.77  Bcast:172.16.6.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:169 Memory:dfdf0000-dfe00000

[p...@node2 ~]$ cat /proc/drbd
version: 8.0.16 (api:86/proto:86)
GIT-hash: d30881451c988619e243d6294a899139eed1183d build by 
[email protected], 2009-08-22 13:26:57
 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r---
    ns:0 nr:72 dw:164 dr:381 al:4 bm:2 lo:0 pe:0 ua:0 ap:0
        resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
        act_log: used:0/257 hits:19 misses:4 starving:0 dirty:0 changed:4

# HERE COMES THE PROBLEM: at this point node1 recovers and become
# the primary again BUT the secondary complains loudly

[/var/log/messages]
Oct 20 17:44:53 node2 ResourceManager[3809]: info: Releasing resource 
group: node1.cluster.nfs.contactlab.lan IPaddr::172.16.6.77/24/eth0 
drbddisk::nfsdata Filesystem::/dev/drbd0::/data::ex
t3 nfslock nfs
Oct 20 17:44:53 node2 ResourceManager[3809]: info: Running 
/etc/init.d/nfs  stop
Oct 20 17:44:53 node2 mountd[3630]: Caught signal 15, un-registering and 
exiting.
Oct 20 17:44:53 node2 ResourceManager[3809]: info: Running 
/etc/init.d/nfslock  stop
Oct 20 17:44:53 node2 rpc.statd[3520]: Caught signal 15, un-registering 
and exiting.
Oct 20 17:44:53 node2 portmap[3914]: connect from 127.0.0.1 to 
unset(status): request from unprivileged port
Oct 20 17:44:53 node2 ResourceManager[3809]: info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /data ext3 stop
Oct 20 17:44:53 node2 Filesystem[3953]: INFO: Running stop for 
/dev/drbd0 on /data
Oct 20 17:44:53 node2 Filesystem[3953]: INFO: Trying to unmount /data
Oct 20 17:44:53 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; 
trying cleanup with SIGTERM
Oct 20 17:44:53 node2 Filesystem[3953]: INFO: No processes on /data were 
signalled
Oct 20 17:44:54 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; 
trying cleanup with SIGTERM
Oct 20 17:44:54 node2 Filesystem[3953]: INFO: No processes on /data were 
signalled
Oct 20 17:44:55 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; 
trying cleanup with SIGTERM
Oct 20 17:44:55 node2 Filesystem[3953]: INFO: No processes on /data were 
signalled
Oct 20 17:44:56 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; 
trying cleanup with SIGKILL
Oct 20 17:44:56 node2 Filesystem[3953]: INFO: No processes on /data were 
signalled
Oct 20 17:44:57 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; 
trying cleanup with SIGKILL
Oct 20 17:44:57 node2 Filesystem[3953]: INFO: No processes on /data were 
signalled
Oct 20 17:44:58 node2 Filesystem[3953]: ERROR: Couldn't unmount /data; 
trying cleanup with SIGKILL
Oct 20 17:44:58 node2 Filesystem[3953]: INFO: No processes on /data were 
signalled
Oct 20 17:44:59 node2 Filesystem[3953]: ERROR: Couldn't unmount /data, 
giving up!
Oct 20 17:44:59 node2 Filesystem[3942]: ERROR:  Generic error
[snip]
Oct 20 17:46:12 node2 Filesystem[5844]: ERROR: Couldn't unmount /data, 
giving up!
Oct 20 17:46:12 node2 Filesystem[5833]: ERROR:  Generic error
Oct 20 17:46:12 node2 ResourceManager[3809]: ERROR: Return code 1 from 
/etc/ha.d/resource.d/Filesystem
Oct 20 17:46:12 node2 Filesystem[6006]: INFO:  Running OK
Oct 20 17:46:12 node2 ResourceManager[3809]: CRIT: Resource STOP 
failure. Reboot required!
Oct 20 17:46:12 node2 ResourceManager[3809]: CRIT: Killing heartbeat 
ungracefully!
Oct 20 17:46:12 node2 kernel: md: stopping all md devices.
Oct 20 17:46:13 node2 kernel: Synchronizing SCSI cache for disk sda:
Oct 20 17:46:13 node2 kernel: usb 4-1: new full speed USB device using 
uhci_hcd and address 2
Oct 20 17:46:13 node2 kernel: usb 4-1: not running at top speed; connect 
to a high speed hub
Oct 20 17:46:13 node2 kernel: usb 4-1: configuration #1 chosen from 1 choice
Oct 20 17:46:13 node2 kernel: hub 4-1:1.0: USB hub found
Oct 20 17:46:13 node2 kernel: hub 4-1:1.0: 4 ports detected
Oct 20 17:46:14 node2 kernel: ACPI: PCI interrupt for device 
0000:02:00.1 disabled
Oct 20 17:46:14 node2 kernel: ACPI: PCI interrupt for device 
0000:02:00.0 disabled
Oct 20 17:48:30 node2 syslogd 1.4.1: restart.
Oct 20 17:48:30 node2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 20 17:48:30 node2 kernel: Linux version 2.6.18-164.el5 
([email protected]) (gcc version 4.1.2 20080704 (Red Hat 
4.1.2-46)) #1 SMP Thu Sep 3 03:28:30 EDT 2009

/etc/drbd.conf:
--------------
global { minor-count 1; }
resource nfsdata {
  protocol C;

  on node1 {
    device /dev/drbd0;
    disk /dev/sda4;
    address 192.168.0.1:7788;
    meta-disk internal;
  }
  on node2 {
    device /dev/drbd0;
    disk /dev/sda4;
    address 192.168.0.2:7788;
    meta-disk internal;
  }

  disk {
    on-io-error detach;
  }
  net {
    max-buffers 2048; #datablock buffers used before writing to disk.
    ko-count 4; # Peer is dead if this count is exceeded.
    #on-disconnect reconnect; # Peer disconnected, try to reconnect.
  }
  syncer {
    rate 10M; # Synchronization rate, in megebytes. Good for 100Mb network.
    #group 1;  # Used for grouping resources, parallel sync.
    al-extents 257; # Must be prime, number of active sets.
  }
  startup {
    wfc-timeout 0; degr-wfc-timeout 120;
  }
}

/etc/ha.d/ha.cf:
---------------
logfacility     local0
keepalive 2
deadtime 10
bcast   eth1
node node1 node2

/etc/ha.d/haresources:
---------------------
gnode1 IPaddr::172.16.6.77/24/eth0 drbddisk::nfsdata 
Filesystem::/dev/drbd0::/data::ext3 nfslock nfs

/etc/ha.d/authkeys:
------------------
auth 3
3 md5 f00b4r



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Heartbeat + drbd + NFS: secondary node (after being primary) goes nuts

Reply via email to