Sebastian Kösters wrote:
> no one of you an idea how to fix this problem?
>
> -----Ursprüngliche Nachricht-----
> Von: [email protected]
> [mailto:[email protected]] Im Auftrag von Sebastian Kösters
> Gesendet: Montag, 12. Januar 2009 23:10
> An: [email protected]
> Betreff: [Linux-HA] Problem with linux-ha and drbd (ERROR: Return code 1 from
> /etc/ha.d/resource.d/Filesystem)
>
> Hi,
>
> today i noticed a problem on my two Heartbeat / DRBD Servers.
>
> on each server there are 2 primary drbd devices
>
> on th-dus-mqm:
>
> drbd0 / drbd2
>
> on th-fra-mqm:
>
> drbd1 / drbd3
>
> if th-dus-mqm fails, drbd0 and drbd2 failover to th-fra-mqm. That normally
> works fine.
>
> Today i tried to stop heartbeat manually on both servers for testing:
>
> /etc/inint.d/heartbeat stop
>
> then i noticed this errors in /var/log/ha-log (in both servers):
>
> ---
>
> heartbeat[2834]: 2009/01/12_22:35:08 info: Heartbeat shutdown in progress.
> (2834)
> heartbeat[4630]: 2009/01/12_22:35:08 info: Giving up all HA resources.
> ResourceManager[4643]: 2009/01/12_22:35:08 info: Releasing resource group:
> th-dus-mqm 10.10.121.130 92.254.37.53 drbddisk::drbd0
> Filesystem::/dev/drbd0::/du
> s::ext3 drbddisk::drbd2 Filesystem::/dev/drbd2::/home/tbmx/dus::ext3 mqm_dus
> ResourceManager[4643]: 2009/01/12_22:35:08 info: Running
> /etc/ha.d/resource.d/mqm_dus stop
This is something you wrote I guess? Make sure it only returns success
when being called with the stop paramter after really everything is
stopped successfully. Read on below.
> ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd2 /home/tbmx/dus ext3 stop
> Filesystem[5005]: 2009/01/12_22:35:09 INFO: Running stop for /dev/drbd2
> on /home/tbmx/dus
> Filesystem[4994]: 2009/01/12_22:35:09 INFO: Success
> ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
> /etc/ha.d/resource.d/drbddisk drbd2 stop
> ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /dus ext3 stop
> Filesystem[5107]: 2009/01/12_22:35:09 INFO: Running stop for /dev/drbd0
> on /dus
> Filesystem[5096]: 2009/01/12_22:35:09 INFO: Success
> ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
> /etc/ha.d/resource.d/drbddisk drbd0 stop
> ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
> /etc/ha.d/resource.d/IPaddr 92.254.37.53 stop
> IPaddr[5200]: 2009/01/12_22:35:09 INFO: Success
> ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
> /etc/ha.d/resource.d/IPaddr 10.10.121.130 stop
> IPaddr[5258]: 2009/01/12_22:35:09 INFO: Success
> ResourceManager[5295]: 2009/01/12_22:35:09 info: Releasing resource group:
> th-fra-mqm 10.10.121.131 92.254.37.54 drbddisk::drbd1
> Filesystem::/dev/drbd1::/fr
> a::ext3 drbddisk::drbd3 Filesystem::/dev/drbd3::/home/tbmx/fra::ext3 mqm_fra
> ResourceManager[5295]: 2009/01/12_22:35:09 info: Running
> /etc/ha.d/resource.d/mqm_fra stop
> ResourceManager[5295]: 2009/01/12_22:35:15 info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd3 /home/tbmx/fra ext3 stop
> Filesystem[5553]: 2009/01/12_22:35:15 INFO: Running stop for /dev/drbd3
> on /home/tbmx/fra
> Filesystem[5553]: 2009/01/12_22:35:15 INFO: Trying to unmount
> /home/tbmx/fra
> Filesystem[5553]: 2009/01/12_22:35:15 INFO: unmounted /home/tbmx/fra
> successfully
> Filesystem[5542]: 2009/01/12_22:35:15 INFO: Success
> ResourceManager[5295]: 2009/01/12_22:35:15 info: Running
> /etc/ha.d/resource.d/drbddisk drbd3 stop
> ResourceManager[5295]: 2009/01/12_22:35:15 info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd1 /fra ext3 stop
> Filesystem[5671]: 2009/01/12_22:35:15 INFO: Running stop for /dev/drbd1
> on /fra
> Filesystem[5671]: 2009/01/12_22:35:15 INFO: Trying to unmount /fra
> Filesystem[5671]: 2009/01/12_22:35:15 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGTERM
> Filesystem[5671]: 2009/01/12_22:35:15 INFO: Some processes on /fra were
> signalled
> Filesystem[5671]: 2009/01/12_22:35:16 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGTERM
> Filesystem[5671]: 2009/01/12_22:35:16 INFO: Some processes on /fra were
> signalled
> Filesystem[5671]: 2009/01/12_22:35:17 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGTERM
> Filesystem[5671]: 2009/01/12_22:35:17 INFO: Some processes on /fra were
> signalled
> Filesystem[5671]: 2009/01/12_22:35:18 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGKILL
> Filesystem[5671]: 2009/01/12_22:35:18 INFO: Some processes on /fra were
> signalled
> Filesystem[5671]: 2009/01/12_22:35:19 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGKILL
> Filesystem[5671]: 2009/01/12_22:35:20 INFO: No processes on /fra were
> signalled
> Filesystem[5671]: 2009/01/12_22:35:21 ERROR: Couldn't unmount /fra,
> giving up!
> Filesystem[5660]: 2009/01/12_22:35:21 ERROR: Generic error
> ResourceManager[5295]: 2009/01/12_22:35:21 ERROR: Return code 1 from
> /etc/ha.d/resource.d/Filesystem
Well it seems the Filesystem agent could not unmount /fra. This happens
for example if some program is still active on the device (holding
filehandles). May that be the case?
> ResourceManager[5295]: 2009/01/12_22:35:22 info: Retrying failed stop
> operation [Filesystem::/dev/drbd1::/fra::ext3]
> ResourceManager[5295]: 2009/01/12_22:35:22 info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd1 /fra ext3 stop
> Filesystem[5839]: 2009/01/12_22:35:22 INFO: Running stop for /dev/drbd1
> on /fra
> Filesystem[5839]: 2009/01/12_22:35:22 INFO: Trying to unmount /fra
> Filesystem[5839]: 2009/01/12_22:35:22 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGTERM
> Filesystem[5839]: 2009/01/12_22:35:22 INFO: No processes on /fra were
> signalled
> Filesystem[5839]: 2009/01/12_22:35:23 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGTERM
> Filesystem[5839]: 2009/01/12_22:35:23 INFO: No processes on /fra were
> signalled
> Filesystem[5839]: 2009/01/12_22:35:24 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGTERM
> Filesystem[5839]: 2009/01/12_22:35:24 INFO: No processes on /fra were
> signalled
> Filesystem[5839]: 2009/01/12_22:35:25 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGKILL
> Filesystem[5839]: 2009/01/12_22:35:25 INFO: Some processes on /fra were
> signalled
> Filesystem[5839]: 2009/01/12_22:35:26 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGKILL
> Filesystem[5839]: 2009/01/12_22:35:26 INFO: No processes on /fra were
> signalled
> Filesystem[5839]: 2009/01/12_22:35:27 ERROR: Couldn't unmount /fra;
> trying cleanup with SIGKILL
> Filesystem[5839]: 2009/01/12_22:35:28 INFO: No processes on /fra were
> signalled
> Filesystem[5839]: 2009/01/12_22:35:29 ERROR: Couldn't unmount /fra,
> giving up!
> Filesystem[5828]: 2009/01/12_22:35:29 ERROR: Generic error
> .......
> ResourceManager[5295]: 2009/01/12_22:36:36 ERROR: Return code 1 from
> /etc/ha.d/resource.d/Filesystem
> Filesystem[9851]: 2009/01/12_22:36:36 INFO: Running OK
Okay, heartbeat retried and failed again.
> ResourceManager[5295]: 2009/01/12_22:36:36 CRIT: Resource STOP failure.
> Reboot required!
> ResourceManager[5295]: 2009/01/12_22:36:36 CRIT: Killing heartbeat
> ungracefully!
So it reboots to clean things up.
Regards
Dominik
> ---
>
> after that the server does a reboot. After the reboot everything is working
> fine again
>
> i dont know why he is not able to unmount the device correct. Sometimes i can
> stop heartbeat without errors and sometimes not.
>
> my haresources file:
>
> ---
>
> th-dus-mqm 10.10.121.130 92.254.37.53 drbddisk::drbd0
> Filesystem::/dev/drbd0::/dus::ext3 drbddisk::drbd2
> Filesystem::/dev/drbd2::/home/tbmx/dus::ext3 mqm_dus
> th-fra-mqm 10.10.121.131 92.254.37.54 drbddisk::drbd1
> Filesystem::/dev/drbd1::/fra::ext3 drbddisk::drbd3
> Filesystem::/dev/drbd3::/home/tbmx/fra::ext3 mqm_fra
>
> ---
>
> my ha.cf:
>
> ---
>
> node th-dus-mqm th-fra-mqm
> ucast bond0.121 10.10.121.132
> ucast bond0.121 10.10.121.133
> auto_failback off
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> warntime 3
> deadtime 6
> initdead 60
> keepalive 2
>
> ---
>
> my drbd.conf:
>
> ---
>
> resource drbd0 {
> protocol C;
> startup {
> become-primary-on th-dus-mqm;
> }
> syncer {
> rate 50M;
> }
> net {
> allow-two-primaries;
> }
> on th-dus-mqm {
> device /dev/drbd0;
> disk /dev/sda10;
> address 10.10.121.132:7766;
> meta-disk internal;
> }
> on th-fra-mqm {
> device /dev/drbd0;
> disk /dev/sda10;
> address 10.10.121.133:7766;
> meta-disk internal;
> }
> }
> resource drbd1 {
> protocol C;
> startup {
> become-primary-on th-fra-mqm;
> }
> syncer {
> rate 50M;
> }
> net {
> allow-two-primaries;
> }
> on th-dus-mqm {
> device /dev/drbd1;
> disk /dev/sda11;
> address 10.10.121.132:7776;
> meta-disk internal;
> }
> on th-fra-mqm {
> device /dev/drbd1;
> disk /dev/sda11;
> address 10.10.121.133:7776;
> meta-disk internal;
> }
> }
> resource drbd2 {
> protocol C;
> startup {
> become-primary-on th-dus-mqm;
> }
> syncer {
> rate 50M;
> }
> net {
> allow-two-primaries;
> }
> on th-dus-mqm {
> device /dev/drbd2;
> disk /dev/sda12;
> address 10.10.121.132:7786;
> meta-disk internal;
> }
> on th-fra-mqm {
> device /dev/drbd2;
> disk /dev/sda12;
> address 10.10.121.133:7786;
> meta-disk internal;
> }
> }
> resource drbd3 {
> protocol C;
> startup {
> become-primary-on th-fra-mqm;
> }
> syncer {
> rate 50M;
> }
> net {
> allow-two-primaries;
> }
> on th-dus-mqm {
> device /dev/drbd3;
> disk /dev/sda13;
> address 10.10.121.132:7796;
> meta-disk internal;
> }
> on th-fra-mqm {
> device /dev/drbd3;
> disk /dev/sda13;
> address 10.10.121.133:7796;
> meta-disk internal;
> }
> }
>
> ---
>
> I hope you guys can help me with my Problem.
>
> Thanks in advanced.
>
> Kind regards
> Sebastian_______________________________________________
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems