Filesystem)

Dominik Klein Thu, 15 Jan 2009 03:12:20 -0800

Sebastian Kösters wrote:
> no one of you an idea how to fix this problem?
> 
> -----Ursprüngliche Nachricht-----
> Von: [email protected] 
> [mailto:[email protected]] Im Auftrag von Sebastian Kösters
> Gesendet: Montag, 12. Januar 2009 23:10
> An: [email protected]
> Betreff: [Linux-HA] Problem with linux-ha and drbd (ERROR: Return code 1 from 
> /etc/ha.d/resource.d/Filesystem)
> 
> Hi,
> 
> today i noticed a problem on my two Heartbeat / DRBD Servers.
> 
> on each server there are 2 primary drbd devices
> 
> on th-dus-mqm:
> 
> drbd0 / drbd2
> 
> on th-fra-mqm:
> 
> drbd1 / drbd3
> 
> if th-dus-mqm fails, drbd0 and drbd2 failover to th-fra-mqm. That normally 
> works fine.
> 
> Today i tried to stop heartbeat manually on both servers for testing:
> 
> /etc/inint.d/heartbeat stop
> 
> then i noticed this errors in /var/log/ha-log (in both servers):
> 
> ---
> 
> heartbeat[2834]: 2009/01/12_22:35:08 info: Heartbeat shutdown in progress. 
> (2834)
> heartbeat[4630]: 2009/01/12_22:35:08 info: Giving up all HA resources.
> ResourceManager[4643]:  2009/01/12_22:35:08 info: Releasing resource group: 
> th-dus-mqm 10.10.121.130 92.254.37.53 drbddisk::drbd0 
> Filesystem::/dev/drbd0::/du
> s::ext3 drbddisk::drbd2 Filesystem::/dev/drbd2::/home/tbmx/dus::ext3 mqm_dus
> ResourceManager[4643]:  2009/01/12_22:35:08 info: Running 
> /etc/ha.d/resource.d/mqm_dus  stop


This is something you wrote I guess? Make sure it only returns success
when being called with the stop paramter after really everything is
stopped successfully. Read on below.

> ResourceManager[4643]:  2009/01/12_22:35:09 info: Running 
> /etc/ha.d/resource.d/Filesystem /dev/drbd2 /home/tbmx/dus ext3 stop
> Filesystem[5005]:       2009/01/12_22:35:09 INFO: Running stop for /dev/drbd2 
> on /home/tbmx/dus
> Filesystem[4994]:       2009/01/12_22:35:09 INFO:  Success
> ResourceManager[4643]:  2009/01/12_22:35:09 info: Running 
> /etc/ha.d/resource.d/drbddisk drbd2 stop
> ResourceManager[4643]:  2009/01/12_22:35:09 info: Running 
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /dus ext3 stop
> Filesystem[5107]:       2009/01/12_22:35:09 INFO: Running stop for /dev/drbd0 
> on /dus
> Filesystem[5096]:       2009/01/12_22:35:09 INFO:  Success
> ResourceManager[4643]:  2009/01/12_22:35:09 info: Running 
> /etc/ha.d/resource.d/drbddisk drbd0 stop
> ResourceManager[4643]:  2009/01/12_22:35:09 info: Running 
> /etc/ha.d/resource.d/IPaddr 92.254.37.53 stop
> IPaddr[5200]:   2009/01/12_22:35:09 INFO:  Success
> ResourceManager[4643]:  2009/01/12_22:35:09 info: Running 
> /etc/ha.d/resource.d/IPaddr 10.10.121.130 stop
> IPaddr[5258]:   2009/01/12_22:35:09 INFO:  Success
> ResourceManager[5295]:  2009/01/12_22:35:09 info: Releasing resource group: 
> th-fra-mqm 10.10.121.131 92.254.37.54 drbddisk::drbd1 
> Filesystem::/dev/drbd1::/fr
> a::ext3 drbddisk::drbd3 Filesystem::/dev/drbd3::/home/tbmx/fra::ext3 mqm_fra
> ResourceManager[5295]:  2009/01/12_22:35:09 info: Running 
> /etc/ha.d/resource.d/mqm_fra  stop
> ResourceManager[5295]:  2009/01/12_22:35:15 info: Running 
> /etc/ha.d/resource.d/Filesystem /dev/drbd3 /home/tbmx/fra ext3 stop
> Filesystem[5553]:       2009/01/12_22:35:15 INFO: Running stop for /dev/drbd3 
> on /home/tbmx/fra
> Filesystem[5553]:       2009/01/12_22:35:15 INFO: Trying to unmount 
> /home/tbmx/fra
> Filesystem[5553]:       2009/01/12_22:35:15 INFO: unmounted /home/tbmx/fra 
> successfully
> Filesystem[5542]:       2009/01/12_22:35:15 INFO:  Success
> ResourceManager[5295]:  2009/01/12_22:35:15 info: Running 
> /etc/ha.d/resource.d/drbddisk drbd3 stop
> ResourceManager[5295]:  2009/01/12_22:35:15 info: Running 
> /etc/ha.d/resource.d/Filesystem /dev/drbd1 /fra ext3 stop
> Filesystem[5671]:       2009/01/12_22:35:15 INFO: Running stop for /dev/drbd1 
> on /fra
> Filesystem[5671]:       2009/01/12_22:35:15 INFO: Trying to unmount /fra
> Filesystem[5671]:       2009/01/12_22:35:15 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGTERM
> Filesystem[5671]:       2009/01/12_22:35:15 INFO: Some processes on /fra were 
> signalled
> Filesystem[5671]:       2009/01/12_22:35:16 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGTERM
> Filesystem[5671]:       2009/01/12_22:35:16 INFO: Some processes on /fra were 
> signalled
> Filesystem[5671]:       2009/01/12_22:35:17 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGTERM
> Filesystem[5671]:       2009/01/12_22:35:17 INFO: Some processes on /fra were 
> signalled
> Filesystem[5671]:       2009/01/12_22:35:18 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGKILL
> Filesystem[5671]:       2009/01/12_22:35:18 INFO: Some processes on /fra were 
> signalled
> Filesystem[5671]:       2009/01/12_22:35:19 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGKILL
> Filesystem[5671]:       2009/01/12_22:35:20 INFO: No processes on /fra were 
> signalled
> Filesystem[5671]:       2009/01/12_22:35:21 ERROR: Couldn't unmount /fra, 
> giving up!
> Filesystem[5660]:       2009/01/12_22:35:21 ERROR:  Generic error
> ResourceManager[5295]:  2009/01/12_22:35:21 ERROR: Return code 1 from 
> /etc/ha.d/resource.d/Filesystem

Well it seems the Filesystem agent could not unmount /fra. This happens
for example if some program is still active on the device (holding
filehandles). May that be the case?

> ResourceManager[5295]:  2009/01/12_22:35:22 info: Retrying failed stop 
> operation [Filesystem::/dev/drbd1::/fra::ext3]
> ResourceManager[5295]:  2009/01/12_22:35:22 info: Running 
> /etc/ha.d/resource.d/Filesystem /dev/drbd1 /fra ext3 stop
> Filesystem[5839]:       2009/01/12_22:35:22 INFO: Running stop for /dev/drbd1 
> on /fra
> Filesystem[5839]:       2009/01/12_22:35:22 INFO: Trying to unmount /fra
> Filesystem[5839]:       2009/01/12_22:35:22 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGTERM
> Filesystem[5839]:       2009/01/12_22:35:22 INFO: No processes on /fra were 
> signalled
> Filesystem[5839]:       2009/01/12_22:35:23 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGTERM
> Filesystem[5839]:       2009/01/12_22:35:23 INFO: No processes on /fra were 
> signalled
> Filesystem[5839]:       2009/01/12_22:35:24 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGTERM
> Filesystem[5839]:       2009/01/12_22:35:24 INFO: No processes on /fra were 
> signalled
> Filesystem[5839]:       2009/01/12_22:35:25 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGKILL
> Filesystem[5839]:       2009/01/12_22:35:25 INFO: Some processes on /fra were 
> signalled
> Filesystem[5839]:       2009/01/12_22:35:26 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGKILL
> Filesystem[5839]:       2009/01/12_22:35:26 INFO: No processes on /fra were 
> signalled
> Filesystem[5839]:       2009/01/12_22:35:27 ERROR: Couldn't unmount /fra; 
> trying cleanup with SIGKILL
> Filesystem[5839]:       2009/01/12_22:35:28 INFO: No processes on /fra were 
> signalled
> Filesystem[5839]:       2009/01/12_22:35:29 ERROR: Couldn't unmount /fra, 
> giving up!
> Filesystem[5828]:       2009/01/12_22:35:29 ERROR:  Generic error
> .......
> ResourceManager[5295]:  2009/01/12_22:36:36 ERROR: Return code 1 from 
> /etc/ha.d/resource.d/Filesystem
> Filesystem[9851]:       2009/01/12_22:36:36 INFO:  Running OK

Okay, heartbeat retried and failed again.

> ResourceManager[5295]:  2009/01/12_22:36:36 CRIT: Resource STOP failure. 
> Reboot required!
> ResourceManager[5295]:  2009/01/12_22:36:36 CRIT: Killing heartbeat 
> ungracefully! 

So it reboots to clean things up.

Regards
Dominik

> ---
> 
> after that the server does a reboot. After the reboot everything is working 
> fine again
> 
> i dont know why he is not able to unmount the device correct. Sometimes i can 
> stop heartbeat without errors and sometimes not.
> 
> my haresources file:
> 
> ---
> 
> th-dus-mqm 10.10.121.130 92.254.37.53 drbddisk::drbd0 
> Filesystem::/dev/drbd0::/dus::ext3 drbddisk::drbd2 
> Filesystem::/dev/drbd2::/home/tbmx/dus::ext3 mqm_dus
> th-fra-mqm 10.10.121.131 92.254.37.54 drbddisk::drbd1 
> Filesystem::/dev/drbd1::/fra::ext3 drbddisk::drbd3 
> Filesystem::/dev/drbd3::/home/tbmx/fra::ext3 mqm_fra
> 
> ---
> 
> my ha.cf:
> 
> ---
> 
> node th-dus-mqm th-fra-mqm
> ucast bond0.121 10.10.121.132
> ucast bond0.121 10.10.121.133
> auto_failback off
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> warntime 3
> deadtime 6
> initdead 60
> keepalive 2
> 
> ---
> 
> my drbd.conf:
> 
> ---
> 
> resource drbd0 {
>   protocol C;
>   startup {
>     become-primary-on th-dus-mqm;
>   }
>   syncer {
>    rate 50M;
>  }
>   net {
>     allow-two-primaries;
>   }
>   on th-dus-mqm {
>     device     /dev/drbd0;
>     disk       /dev/sda10;
>     address    10.10.121.132:7766;
>     meta-disk  internal;
>   }
>   on th-fra-mqm {
>     device    /dev/drbd0;
>     disk      /dev/sda10;
>     address   10.10.121.133:7766;
>     meta-disk internal;
>   }
> }
> resource drbd1 {
>   protocol C;
>   startup {
>     become-primary-on th-fra-mqm;
>   }
>   syncer {
>     rate 50M;
>   }
>   net {
>     allow-two-primaries;
>   }
>   on th-dus-mqm {
>     device     /dev/drbd1;
>     disk       /dev/sda11;
>     address    10.10.121.132:7776;
>     meta-disk  internal;
>   }
>   on th-fra-mqm {
>     device    /dev/drbd1;
>     disk      /dev/sda11;
>     address   10.10.121.133:7776;
>     meta-disk internal;
>   }
> }
> resource drbd2 {
>   protocol C;
>   startup {
>     become-primary-on th-dus-mqm;
>   }
>   syncer {
>     rate 50M;
>   }
>   net {
>     allow-two-primaries;
>   }
>   on th-dus-mqm {
>     device     /dev/drbd2;
>     disk       /dev/sda12;
>     address    10.10.121.132:7786;
>     meta-disk  internal;
>   }
>   on th-fra-mqm {
>     device    /dev/drbd2;
>     disk      /dev/sda12;
>     address   10.10.121.133:7786;
>     meta-disk internal;
>   }
> }
> resource drbd3 {
>   protocol C;
>   startup {
>     become-primary-on th-fra-mqm;
>   }
>   syncer {
>     rate 50M;
>   }
>   net {
>     allow-two-primaries;
>   }
>   on th-dus-mqm {
>     device     /dev/drbd3;
>     disk       /dev/sda13;
>     address    10.10.121.132:7796;
>     meta-disk  internal;
>   }
>   on th-fra-mqm {
>     device    /dev/drbd3;
>     disk      /dev/sda13;
>     address   10.10.121.133:7796;
>     meta-disk internal;
>   }
> }
> 
> ---
> 
> I hope you guys can help me with my Problem.
> 
> Thanks in advanced.
> 
> Kind regards
> Sebastian_______________________________________________
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: AW: [Linux-HA] Problem with linux-ha and drbd (ERROR: Return code 1 from /etc/ha.d/resource.d/Filesystem)

Reply via email to