[Linux-HA] Wierd heartbeat problem.

Nikhil Kulkarni Mon, 14 Jul 2008 10:59:57 -0700

Hi,


I have a weird heartbeat problem and am hoping that someone on this
email list can point to something

that I am doing wrong.

 

I have drbd+heartbeat+nfs running over 2 systems installed with CentOS
5.1 operating system.

 

My ha.cf is:

debugfile /var/log/ha-debug.log

logfile /var/log/ha.log

logfacility local0

keepalive 2

warntime 10

deadtime 20

initdead 75

bcast eth0

auto_failback off

node watchdog-client1

node watchdog-client2

 

My drbd.conf is

global {

  usage-count yes;

}

common {

  protocol C;

}

resource r0 {

  startup {

    wfc-timeout 0;

    degr-wfc-timeout 120;    # 2 minutes.

  }

  disk {

    on-io-error   detach;

  }

  net {

    timeout           60;

    connect-int       10;

    ping-int          10;

    max-buffers     4096;

    max-epoch-size  4096;

  }

  syncer {

    rate 500M;

    al-extents 257;

  }

  on watchdog-client1 {

    device    /dev/drbd1;

    disk      /dev/sdb1;

    address   10.0.38.40:7789;

    meta-disk /dev/sda6[0];

  }

  on watchdog-client2 {

    device    /dev/drbd1;

    disk      /dev/sdb1;

    address   10.0.38.43:7789;

    meta-disk /dev/sda6[0];

  }

}

 

This is my haresources file:

watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0
Filesystem::/dev/drbd1:://mnt/data::ext3 kill nfs Delay::3::0 nfs
nfslock

 

watchdog-client1 is the initial Primary node and client2 is the
secondary node.

 

When I shutdown client1, client2 becomes the primary and mounts the file
system as expected.

Everything works like a charm!!!

 

Now, when I power back client1, it never becomes the secondary and the
file system on client2 for some reason gets un-mounted and never gets
mounted again.

 

The logs on the client1 machine, tell me that it's trying to mount the
file system twice (as shown in bold below) and it fails the second time,
since the file system is already mounted the first time successfully.

Since, it fails the second time, the heartbeat resource just stops all
its other processes.

 

Below is my client1 logs:

heartbeat[2542]: 2008/07/14_09:54:48 info: Received shutdown notice from
'watchdog-client2'.

heartbeat[2542]: 2008/07/14_09:54:48 info: Resources being acquired from
watchdog-client2.

heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq():
child count 1

heartbeat[2847]: 2008/07/14_09:54:48 info: acquire all HA resources
(standby).

ResourceManager[2873]:  2008/07/14_09:54:48 info: Acquiring resource
group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0
Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0
nfs nfslock

IPaddr[2915]:   2008/07/14_09:54:48 INFO:  Resource is stopped

IPaddr[2929]:   2008/07/14_09:54:48 INFO:  Resource is stopped

heartbeat[2848]: 2008/07/14_09:54:48 info: Local Resource acquisition
completed.

heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq():
child count 2

heartbeat[2542]: 2008/07/14_09:54:48 debug: StartNextRemoteRscReq():
child count 1

ResourceManager[2873]:  2008/07/14_09:54:48 info: Running
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start

ResourceManager[2873]:  2008/07/14_09:54:48 debug: Starting
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start

IPaddr[3072]:   2008/07/14_09:54:48 INFO: Using calculated netmask for
10.0.38.71: 255.255.255.0

IPaddr[3072]:   2008/07/14_09:54:48 DEBUG: Using calculated broadcast
for 10.0.38.71: 10.0.38.255

IPaddr[3072]:   2008/07/14_09:54:48 INFO: eval ifconfig eth0:0
10.0.38.71 netmask 255.255.255.0 broadcast 10.0.38.255

IPaddr[3072]:   2008/07/14_09:54:48 DEBUG: Sending Gratuitous Arp for
10.0.38.71 on eth0:0 [eth0]

IPaddr[3043]:   2008/07/14_09:54:48 INFO:  Success

INFO:  Success

ResourceManager[2873]:  2008/07/14_09:54:48 debug:
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start done. RC=0

ResourceManager[2873]:  2008/07/14_09:54:48 info: Running
/etc/ha.d/resource.d/drbddisk r0 start

ResourceManager[2873]:  2008/07/14_09:54:48 debug: Starting
/etc/ha.d/resource.d/drbddisk r0 start

ResourceManager[2873]:  2008/07/14_09:54:48 debug:
/etc/ha.d/resource.d/drbddisk r0 start done. RC=0

Delay[3231]:    2008/07/14_09:54:51 INFO: Delay is stopped

Delay[3220]:    2008/07/14_09:54:51 INFO:  Resource is stopped

ResourceManager[2873]:  2008/07/14_09:54:51 info: Running
/etc/ha.d/resource.d/Delay 3 0 start

ResourceManager[2873]:  2008/07/14_09:54:51 debug: Starting
/etc/ha.d/resource.d/Delay 3 0 start

Delay[3278]:    2008/07/14_09:54:54 INFO:  Success

INFO:  Success

ResourceManager[2873]:  2008/07/14_09:54:54 debug:
/etc/ha.d/resource.d/Delay 3 0 start done. RC=0

Filesystem[3325]:       2008/07/14_09:54:54 INFO:  Resource is stopped

ResourceManager[2873]:  2008/07/14_09:54:55 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start

ResourceManager[2873]:  2008/07/14_09:54:55 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start

Filesystem[3406]:       2008/07/14_09:54:55 INFO: Running start for
/dev/drbd1 on //mnt/data

Filesystem[3395]:       2008/07/14_09:54:55 INFO:  Success

INFO:  Success

ResourceManager[2873]:  2008/07/14_09:54:55 debug:
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done.
RC=0

ResourceManager[2873]:  2008/07/14_09:54:55 info: Running
/etc/init.d/killnfs  start

ResourceManager[2873]:  2008/07/14_09:54:55 debug: Starting
/etc/init.d/killnfs  start

kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ...
or kill -l [sigspec]

ResourceManager[2873]:  2008/07/14_09:54:56 debug: /etc/init.d/killnfs
start done. RC=0

Delay[3515]:    2008/07/14_09:54:59 INFO: Delay is running OK

Delay[3504]:    2008/07/14_09:54:59 INFO:  Running OK

ResourceManager[2873]:  2008/07/14_09:54:59 info: Running
/etc/init.d/nfs  start

ResourceManager[2873]:  2008/07/14_09:54:59 debug: Starting
/etc/init.d/nfs  start

Starting NFS services:  [  OK  ]

Starting NFS quotas: [  OK  ]

Starting NFS daemon: [  OK  ]

Starting NFS mountd: [  OK  ]

Starting RPC idmapd: [  OK  ]

ResourceManager[2873]:  2008/07/14_09:55:00 debug: /etc/init.d/nfs
start done. RC=0

ResourceManager[2873]:  2008/07/14_09:55:00 info: Running
/etc/init.d/nfslock  start

ResourceManager[2873]:  2008/07/14_09:55:00 debug: Starting
/etc/init.d/nfslock  start

Starting NFS statd: [  OK  ]

ResourceManager[2873]:  2008/07/14_09:55:00 debug: /etc/init.d/nfslock
start done. RC=0

heartbeat[2847]: 2008/07/14_09:55:00 info: all HA resource acquisition
completed (standby).

heartbeat[2542]: 2008/07/14_09:55:00 info: Standby resource acquisition
done [all].

heartbeat[3753]: 2008/07/14_09:55:00 debug: notify_world: setting
SIGCHLD Handler to SIG_DFL

harc[3753]:     2008/07/14_09:55:00 info: Running /etc/ha.d/rc.d/status
status

mach_down[3769]:        2008/07/14_09:55:00 info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources
acquired

mach_down[3769]:        2008/07/14_09:55:00 info: mach_down takeover
complete for node watchdog-client2.

heartbeat[2542]: 2008/07/14_09:55:00 info: mach_down takeover complete.

heartbeat[3803]: 2008/07/14_09:55:00 debug: notify_world: setting
SIGCHLD Handler to SIG_DFL

harc[3803]:     2008/07/14_09:55:00 info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp

ip-request-resp[3803]:  2008/07/14_09:55:00 received ip-request-resp
IPaddr::10.0.38.71/24/eth0 OK yes

ResourceManager[3824]:  2008/07/14_09:55:00 info: Acquiring resource
group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0
Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0
nfs nfslock

IPaddr[3851]:   2008/07/14_09:55:00 INFO:  Running OK

Delay[3941]:    2008/07/14_09:55:03 INFO: Delay is running OK

Delay[3930]:    2008/07/14_09:55:03 INFO:  Running OK

Filesystem[3978]:       2008/07/14_09:55:03 INFO:  Resource is stopped

ResourceManager[3824]:  2008/07/14_09:55:03 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start

ResourceManager[3824]:  2008/07/14_09:55:03 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start

Filesystem[4059]:       2008/07/14_09:55:03 INFO: Running start for
/dev/drbd1 on //mnt/data

mount: /dev/drbd1 already mounted or //mnt/data busy

mount: according to mtab, /dev/drbd1 is mounted on /mnt/data

Filesystem[4059]:       2008/07/14_09:55:03 ERROR: Couldn't mount
filesystem /dev/drbd1 on //mnt/data

Filesystem[4048]:       2008/07/14_09:55:03 ERROR:  Generic error

ERROR:  Generic error

ResourceManager[3824]:  2008/07/14_09:55:03 debug:
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done.
RC=1

ResourceManager[3824]:  2008/07/14_09:55:03 ERROR: Return code 1 from
/etc/ha.d/resource.d/Filesystem

ResourceManager[3824]:  2008/07/14_09:55:03 CRIT: Giving up resources
due to failure of Filesystem::/dev/drbd1:://mnt/data::ext3

ResourceManager[3824]:  2008/07/14_09:55:03 info: Releasing resource
group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0
Delay::3::0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0
nfs nfslock

ResourceManager[3824]:  2008/07/14_09:55:03 info: Running
/etc/init.d/nfslock  stop

ResourceManager[3824]:  2008/07/14_09:55:03 debug: Starting
/etc/init.d/nfslock  stop

Stopping NFS locking: [  OK  ]

Stopping NFS statd: [  OK  ]

ResourceManager[3824]:  2008/07/14_09:55:03 debug: /etc/init.d/nfslock
stop done. RC=0

ResourceManager[3824]:  2008/07/14_09:55:03 info: Running
/etc/init.d/nfs  stop

ResourceManager[3824]:  2008/07/14_09:55:03 debug: Starting
/etc/init.d/nfs  stop

Shutting down NFS mountd: [  OK  ]

Shutting down NFS daemon: [  OK  ]

Shutting down NFS quotas: [  OK  ]

Shutting down NFS services:  [  OK  ]

ResourceManager[3824]:  2008/07/14_09:55:04 debug: /etc/init.d/nfs  stop
done. RC=0

ResourceManager[3824]:  2008/07/14_09:55:04 info: Running
/etc/ha.d/resource.d/Delay 3 0 stop

ResourceManager[3824]:  2008/07/14_09:55:04 debug: Starting
/etc/ha.d/resource.d/Delay 3 0 stop

Delay[4249]:    2008/07/14_09:55:04 INFO:  Success

INFO:  Success

ResourceManager[3824]:  2008/07/14_09:55:04 debug:
/etc/ha.d/resource.d/Delay 3 0 stop done. RC=0

ResourceManager[3824]:  2008/07/14_09:55:04 info: Running
/etc/init.d/killnfs  stop

ResourceManager[3824]:  2008/07/14_09:55:04 debug: Starting
/etc/init.d/killnfs  stop

ResourceManager[3824]:  2008/07/14_09:55:05 debug: /etc/init.d/killnfs
stop done. RC=0

ResourceManager[3824]:  2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop

ResourceManager[3824]:  2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop

Filesystem[4343]:       2008/07/14_09:55:05 INFO: Running stop for
/dev/drbd1 on //mnt/data

Filesystem[4332]:       2008/07/14_09:55:05 INFO:  Success

INFO:  Success

ResourceManager[3824]:  2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 stop done.
RC=0

ResourceManager[3824]:  2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/Delay 3 0 stop

ResourceManager[3824]:  2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/Delay 3 0 stop

Delay[4426]:    2008/07/14_09:55:05 INFO: Delay already stopped.

Delay[4415]:    2008/07/14_09:55:05 INFO:  Success

INFO:  Success

ResourceManager[3824]:  2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/Delay 3 0 stop done. RC=0

ResourceManager[3824]:  2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/drbddisk r0 stop

ResourceManager[3824]:  2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/drbddisk r0 stop

ResourceManager[3824]:  2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/drbddisk r0 stop done. RC=0

ResourceManager[3824]:  2008/07/14_09:55:05 info: Running
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop

ResourceManager[3824]:  2008/07/14_09:55:05 debug: Starting
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop

In IP Stop

SIOCDELRT: No such process

IPaddr[4522]:   2008/07/14_09:55:05 INFO: ifconfig eth0:0 down

IPaddr[4493]:   2008/07/14_09:55:05 INFO:  Success

INFO:  Success

ResourceManager[3824]:  2008/07/14_09:55:05 debug:
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 stop done. RC=0

heartbeat[2542]: 2008/07/14_09:55:09 WARN: node watchdog-client2: is
dead

heartbeat[2542]: 2008/07/14_09:55:09 info: Dead node watchdog-client2
gave up resources.

heartbeat[2542]: 2008/07/14_09:55:09 info: Link watchdog-client2:eth0
dead.

hb_standby[4564]:       2008/07/14_09:55:35 Going standby [foreign].

heartbeat[2542]: 2008/07/14_09:55:35 info: watchdog-client1 wants to go
standby [foreign]

heartbeat[2542]: 2008/07/14_09:55:45 WARN: No reply to standby request.
Standby request cancelled.

[EMAIL PROTECTED] log]#

 

Any idea why it's trying to mount the file system twice? I think that's
what's causing the problem.

Also, on a reboot, shouldn't client1 become the secondary and client 2
stay as the primary node?

 

If anyone needs any other information, please let me know. I'm trying to
fight this problem un-successfully for the past 2-3 days now.

 

Thanks,

Nikhil

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Wierd heartbeat problem.

Reply via email to